Two items high on the list of concerns for many companies are backups, and cost. We recently had a client looking for a cost-effective but reliable solution for backing up their internal SharePoint 2010 environment.
We modified the script to fit our client’s environment, and wrapped it in a batch file that does a few other clean-up and move-related tasks, scheduled it nightly, and were done, or so we thought.
Seemingly randomly, the client would call and be locked out of one of their SharePoint sites. The fix was easy enough – open Central Admin, navigate to Application Management / Site Collections, Configure Quotas and Locks:
Select the site that is locked out, and change the status from “Read-only” to “Not Locked”.
Fine, but why is it happening?
The PowerShell script that is being used to back up the sites puts them into a Read-only status, to ensure no changes occur during the backup routine. On occasion though, PowerShell is crashing. It seems to happen at a random point during execution (meaning any of the sites “could” potentially be locked out, depending on how far the script got before it died.)
Each time the crash occurred, we would see a very generic Application Log entry on the server:
Log Name: Application
Source: Windows Error Reporting
Date: 2/23/2015 10:00:56 PM
Event ID: 1001
Task Category: None
Fault bucket , type 0
Event Name: APPCRASH
Response: Not available
Cab Id: 0
Apparently we’re not the only ones seeing this problem, as it is also reported in the CodePlex discussion forums here.
Knowing this is a little out of our power to actually fix, (there must be some set of circumstances that occurs during the execution of the script that will occasionally cause PowerShell to crash, and I highly doubt it is at the top of Microsoft’s list of things to fix) we needed to come up with a workaround that would allow the client to continue working, but that still gets regular usable backups.
What we want to happen moving forward is to:
- Notice that the error has occurred.
- Log what sites are in “Read-only” status.
- Unlock the sites.
- Notify the client that the crash occurred, and provide the current status of the sites.
Note: The notification is important in case the backup starts crashing on a more regular basis and backup frequency becomes an issue.
So we decided to add a little error trapping of our own.
We created 3 new scripts:
- A batch file to fire when event 1001 occurs in the Application log
- A PowerShell script to check the status of the SharePoint sites
- A PowerShell script to change the status of the SharePoint sites
The reason for the two separate PowerShell scripts is because I want to output the status of the sites to a log file before and after the change scripts run in the hopes that we’ll catch some pattern that will help us determine the cause of the initial crash.
Let’s look at the 2 PowerShell scripts first:
The “Check” script is simply accepting an argument (this will be the SharePoint site name), loading the SharePoint components for PowerShell, checking the status of the site and formatting it into a table.
The “Change” script is similar. It accepts an argument (again the SharePoint site name), registers the SharePoint components for PowerShell, and changes the status of the site to “Not Locked”.
With those two scripts in place, we can now use a batch file to cycle through the sites, gather information, and unlock them:
Note: In this case our client has 4 sites on their server. We could make this script more generic by reading the sites and adding logic to dynamically cycle through them, but being that we only have one client with this issue right now, and we were pressed for time to get it in place, we stopped here. If you are going to use this script, you can simply add or remove sites and their references within the script. The yellow highlighted text would need to be changed for another environment.
Note: In case you’re not familiar, Blat is a little SMTP program we often use that has been downloaded to this SharePoint server to send the email notification.
In a nutshell the above script loops through itself 3 times, completing the following:
- Documenting the current status of the SharePoint Sites.
- Unlocking the SharePoint Sites.
- Re-documenting the current status of the SharePoint sites.
That information is then emailed to a group of people that can review and take action if necessary.
The email that arrives for the notification looks like this (and is based on the “C:localscriptLockState.txt” file generated by the batch file):
Finally, in order to get this script to function, a new Task in Task Scheduler is created:
With the following trigger:
Begin the Task: On an event
Source: Windows Error Reporting
Event ID: 1001
With the following action:
Action: Start a Program
This script will now be executed if the Application Log reports event ID 1001 from Windows Error Reporting.
We may end up taking this script one step farther and have it re-invoke the backup script at the end of its routine. For now, the only reason we are not doing that is because we are still collecting information on frequency, and hoping to find the root cause. Automating another call to the backup routine could potentially put us into an endless loop situation (if the crash starts happening frequently).
Also, the Event ID we’re using to trigger the batch file can be logged for other application crashes, which could potentially start the backup at a time that would be inconvenient.
Although I do not consider this a permanent solution, it does give the client a way to continue operating without manual intervention when the crash does occur. We’re currently seeing the crash frequency about once every 4 to 6 weeks, and for our client this is a suitable workaround.
Hopefully this can assist others experiencing the same issue.