Async Workflows and System Jobs Stop Working and Stuck with Status Reason "Waiting for Resources"

Today, I encountered one issue at the client site where none of the asynchronous workflows are working any more. Any new workflow triggered on-demand or automatically by the event just go straight into the "Waiting for Resources" status.

I tried practically everything that I've heard about the solution for the async workflow issues.
Restarted the Microsoft Dynamics CRM Asynchronous Processing Service
Restarted the rest of the CRM services
Restarted the SQL Server service
Restarted both CRM application server and database server
Went through several blog posts and forum thread related to that issue. (like thisthis, this, this and this)

The earliest time of the system jobs being stuck in that status was also some how coincident with the server restart after Windows security update on last Sunday so that we tried to uninstall those updates as well.
We also encountered the following error message when we try to restart the async service "Error 1053: The service did not respond to the start or control request in a timely fashion" so that we even thought about installing async service in another machine as a server role.

And finally, I've found the root cause of the issue after trying out different things.

SYMPTOMS

Symptom 1


When you stop or restart a Microsoft Dynamics CRM Asynchronous Processing Service, the process of stopping or restarting the service takes more time than the default configured time, you receive the following error message:
Could not stop the Microsoft Dynamics CRM Asynchronous Processing Service on Local Computer.
Error 1053: The service did not respond to the start or control request in a timely fashion.


Symptom 2

Found a lot of Warning logs in the event viewer of CRM application server from the Source MSCRMPlatform with the following message.
Query execution time of 30.0 seconds exceeded the threshold of 10 seconds. Thread: 85; Database: OrgName_MSCRM; Server:DBServerName; Query: 
SET DEADLOCK_PRIORITY LOW
BEGIN TRANSACTION
    DECLARE @res INT
    EXEC @res = sp_getapplock                 
                @Resource = 'AsyncService_app_lock',
                @LockMode = 'Exclusive'
    IF @res NOT IN (0, 1)
    BEGIN
        RAISERROR ( 'Unable to acquire App Lock', 16, 1 )
    END 
    ELSE
    BEGIN

update AsyncOperationBase WITH  (READPAST,READCOMMITTEDLOCK,UPDLOCK)
set
ModifiedOn = '02/20/2015 13:41:04',
ModifiedBy = CreatedBy
where 
StateCode = 2
and HostId = 'server.MSCRMAsyncService.737ea3a3-ca4d-4500-a0b0-f4b94f00e442'
EXEC @res = sp_releaseapplock 
                            @Resource = 'AsyncService_app_lock'
    END    
COMMIT.
Note OrgName is a placeholder for the name of your organisation and DBServerName is a placeholder for the name of your database server.

Symptom 3

All of the async system jobs are stuck with Status Reason "Waiting for Resources" with a few jobs with "In Progress"



CAUSE

Some of the initial async system jobs are stuck with Status Reason "In Progress" due to SQL timeout while being processed. Those system jobs encountering SQL timeout were also not being changed into Failed so that the rest of the system jobs "Waiting for Resources" cannot be processed.

RESOLUTION

To resolve this problem, stop the MS CRM Async service. Then, all of the system jobs being stuck with Status Reason "In Progress" will be changed back to "Waiting for Resources". After that, patch the statecode and statuscode of those system job to "Canceled" based on the values in in the table below.
StatusCode
0 Waiting For Resources
10 Waiting
20 In Progress
21 Pausing
22 Canceling
30 Succeeded
31 Failed
32 Canceled
StateCode
0 Ready
1 Suspended
2 Locked
3 Completed

(You may update the [Message] column with certain value in order to re-trigger those workflows at the later stage)
Here's the code that I used to patched for your reference.
 UPDATE AsyncOperationBase SET StateCode = 3 -- Completed  
 , StatusCode = 32 -- Canceled  
 , [Message] = 'WorkflowErrorPatched'  
 WHERE StatusCode = 20 -- In Progress  
 OR StatusCode = 0 -- Waiting For Resources  

After that, start the MS CRM Async service again and see if the workflows are running back again. If so, update the statecode and statuscode of those system job that we cancelled in the earlier step back to "Waiting for Resources" batch by batch in descending order of [CreatedOn] column since the initial ones are those system jobs which caused the SQL timeout issue. After changing them back to "Waiting for Resources", those normal ones will be processed and finally changed to "Succeded". And finally when you change the problematic ones back to "Waiting for Resources", all of the async system jobs will be stuck again just like the initial state.

Once you are able to pinpoint the affected async system jobs, just keep them in "Canceled" status and let the rest of the workflows complete.

MORE INFORMATION

For my issue, the affected workflows are those related to the note (annotation) records which has attachment files > 10MB. I'm not sure the size of the attachment file is anything to do with the simple On Create workflow which just update the flag of the case entity based. But at least we are able to identify the root cause of the issue and I'll update in this blog post if I happen to investigate further and find out the link between those SQL timeout workflows and the size of the attachment in the annotation record.









Comments

  1. Hey LZW, just wanted to thank you on your write up. I just hit this issue on a massive production server where the async jobs were piling up like crazy and we couldn't figure out why. Thanks for posting your findings!

    ReplyDelete
    Replies
    1. Glad to know that the information that I shared is helpful for others as well.
      And thanks for leaving a comment. That motivates me to post more and more in this blog.

      Delete
  2. LZW!!! You are a life-saver my good man! Our CRM was nearly completely non-functioning due to this issue you detailed above, and I could not figure it out how to get work flows and events processing again. It was completely by chance that I came across your posting, and man was I fortunate I did. You saved this DBA from pulling more hair out not getting the issue resolved. I am forever in gratitude to you!

    Erik G
    Spectra Logic

    ReplyDelete

Post a Comment

Popular Posts