Quantcast
Channel: SharePoint 2013 - General Discussions and Questions forum
Viewing all articles
Browse latest Browse all 26374

Workflow timer job stucks at running state

$
0
0

Dear Community,

in December last year we encountered a problem with our workflow timer job on our non-clustered SharePoint 2013 Farm, so one front-end and ond application-server.
This issue isn't solved until let. Let me describe it a little more detailed and also what we already tried to solve this problem.

One day one of our department called and told me, that their started workflows with pausing actions (Like "Pause workflow for 1 minute") seem to be stuck. So I connected to their SharePoint server and checked the state of workflow timer job. It was on "running" - for like 36 hours until the last succesful run of this job.
So i thought a restart of the servers, front-end and application, should do it. But that wasn't the case. The timer job still said "running".

After some investigation it turned out that this is not a rare problem. The most mentioned suggestion is to clear the timer job cache, with stopping the SharePoint Timer Service before and other variations. I did that, several times, and it seems to work first, but only for like 3 to 4 times (the workflow timer job runs successfully and actually triggers the new workflow instances with pausing actions).
But then, on the 4th or 5th attempt of the timer job, it stucks again on the running state.

The problem first occurred on the Dec 14th 2016 at 11:00 pm. So I personally don't think there is some corrupted workflow instance, as if it was, the problem must have occurred somewhen in our usual working times, so midday or afternoon.

All the workflows are Nintex based. It should be mentioned, that the the Nintex timer jobs are running smoothly, it's just the default SharePoint workflow timer job that seems to make a problem.
With our external support we checked the job definition of our workflow timer job, which also seems to be fine (default configuration).
We couldn't find anything in the ULS logs that points out to be problem, just some lines with "Could not run workflow timer job as it is already running" (translated from german).
One of our thoughts also was, that there are already too many workflow instances running, as they have become more and more after the last years. So I cancelled about 300-400 very old workflow instances no one probably would ever need again.
But nothing has changed - after clearing the timer job cache, it ran 3 times successfully, the 4th time it stucked again on running.

We currently don't know what else we could try to get the timer job running successfully again or what to search for.
Maybe there is something in the logs we missed, maybe there is some special corrupted workflow instance (but how to find out which one?).

Our SharePoint configuration for the affected farm:
SharePoint 2013 - Foundation
RAM and CPU: I don't know exactly (currently I'm out of office), but it should be enough. I checked the task manager on both servers and none of them was even nearly overloaded.
Patchlevel/Database-Version: 15.0.4841.1000 (July 2016 CU)

If you need any further information please tell me. I'll try to answer back as soon as possible.
Your help is very appreciated!

Thanks, kind regards
Roman

Attempts to solve this issue (I'll Keep this updated):

  • Deleting the file system cache on all SharePoint Servers in \ProgramData\Microsoft\SharePoint\Config\{ID}\ like mentioned in many Blogs and threads
  • resetting IIS or rebooting all Servers
  • let the timer Jobs run on a different server
  • Cancel old workflow instances, that aren't needed anymore
  • Disable/Enable the workflow timer Job
  • Re-Ran the Configuration Wizard on all servers in the farm




Viewing all articles
Browse latest Browse all 26374

Trending Articles