Most common deployment slot swap failures and how to fix them

Azure Web App Deployment Slots are used to deploy new versions of an application code into production with no interruption to the production traffic. In order to achieve this the swap process involves multiple steps that are performed to prepare the new version of the code to successfully handle the load once it is in production slot. Some of these steps may go wrong, especially when the new version of the code does not cooperate well. This in turn either causes the swap to fail or it results in swapping new code in production while it is still not ready to handle the production load. This post describes the most common reasons why this may happen and how to correct them.

In order to better understand the reasons for the swap failures it is first necessary to explain how the application code in the staging slot is initialized / warmed up prior to the swap to production. Failures during these steps are the most common reasons for the overall failure of the swap operation.

The swap operation is done by an internal process that runs within a scale unit where web app is hosted. Here are the steps that it performs to ensure the application is initialized prior to the swap. Note that the same sequence of actions happens during Auto-Swap and Swap with Preview.

  • Apply the production configuration settings to all web app’s instances in the staging slot. This happens when web app has appsettings or connection strings marked as “Slot settings” or if Continuous Deployment is enabled for the site or if Site Authentication is enabled. This will trigger all instances in the staging slot to restart. (For Swap with Preview this the first phase of the swap after which the swap process is paused and you can validate that the application works correctly with production settings)
  • Wait for every instance to complete its restart. If some instance failed to restart then the swap process will revert any configuration changes to the app in staging slot and will not proceed further. If that happens the first place to look into is the D:\home\LogFiles\eventlog.xml file of the application specific error log (such as php_errors.log for PHP apps) where you may find more clues what prevents application from starting.
  • If Local Cache is enabled then swap process will trigger Local Cache initialization by making an HTTP request to the root directory URL path (“/”) of the web app on every web worker. Local Cache Initialization consists of copying the site’s content files from network share to the local disk of the worker and then re-pointing the web app to use local disk for its content. This causes another restart of the web app. The swap process will wait until the Local Cache is completely initialized and restarted on every instance before proceeding further. A common reason why Local Cache Initialization may fail is when site content size exceeds the local disk quota specified for the Local Cache. If that is the case the the quota can be increased by following instructions from Local Cache Documentation.
  • If Application Initialization (AppInit) is enabled then swap process will make another HTTP request to the root URL path on every web worker. The AppInit is a module that runs within the web app request processing pipeline and it gets executed when web app starts. The only thing the swap process does with its first HTTP request to the web app is it triggers the AppInit module to do its work. After that it just waits until AppInit reports that it has completed the warmup. AppInit module uses the list of URL paths specified inside web.config file and makes internal HTTP requests to each of those. All these requests are within the web app process. It does not call any external URL’s and its requests are not going through the scale unit’s front ends. Also, neither the initial HTTP request nor AppInit internal requests follow HTTP redirects. That causes the most common problem that users run into with this module. If web app has such rewrite rules as “Enforce Domain” or “Enforce HTTPs” then none of the warmup requests will actually reach the application code. All the requests will be shortcut by the rewrite rules. In order to prevent that the rewrite rules need to be modified like below:
<rule name="Canonical Host Name" stopProcessing="true">
  <match url="(.*)" />
  <conditions>
    <add input="{WARMUP_REQUEST}" pattern="1" negate="true" />
    <add input="{REMOTE_ADDR}" pattern="^100?\." negate="true" />
    <add input="{HTTP_HOST}" negate="true" pattern="^ruslany\.net$" />
  </conditions>
  <action type="Redirect" url="http://ruslany.net/{R:1}" redirectType="Permanent" />
</rule>
<rule name="Redirect to HTTPS" stopProcessing="true">
  <match url="(.*)" />
  <conditions>
    <add input="{WARMUP_REQUEST}" pattern="1" negate="true" />
    <add input="{REMOTE_ADDR}" pattern="^100?\." negate="true" />
    <add input="{HTTPS}" pattern="^OFF$" />
  </conditions>
  <action type="Redirect" url="https://{HTTP_HOST}/{R:1}" redirectType="Permanent" />
</rule>

The {WARMUP_REQUEST} is a server variable that is set by AppInit module for each of its internal requests. That is a reliable way to distinguish whether the request is external or is made by AppInit module. The {REMOTE_ADDR} is a server variable that contains the IP address of HTTP client. The IP address ranges starting with “10.” or “100.” are internal to the scale unit and no outside HTTP client can use them.

  • If AppInit is not enabled then swap process just makes an HTTP request to the root path of the webapp on each web worker and as long as it receives some HTTP response it considers the warmup complete. Again the rewrite rules in the web app can cause the site to return HTTP redirect response and the actual application code will not be executed at all. Since the AppInit is not involved here the only way to prevent the rewrite rules is to use the {REMOTE_ADDR} server variable in the rule’s conditions as shown below.
<conditions>
  <add input="{REMOTE_ADDR}" pattern="^100?\." negate="true" />
</conditions>
  • After all the above steps are completed successfully the actual swap is performed by switching the routing rules on the scale unit’s front ends. More details on what happens during the swap can be found in other blog post “How to warm up Azure Web App during deployment slots swap

Some other common problems that cause the swap to fail:

  • An HTTP request to the root URL path times out. The swap process waits for 90 seconds for the request to return. In case of timeout the request will be retried for up to 5 times. If after that the request still times out then the swap operation will be aborted. If that happens then check the eventlog.xml or application specific error log to see if there are any indications of what causes the timeout.
  • An HTTP request to the root URL path is aborted. This may happen if web app has a rewrite rule to abort some requests. If that is the case then the rule can be modified by adding the {REMOTE_ADDR} check as shown in previous examples.
  • Web App has IP restriction rules that prevent the swap process from connecting to it. In that case you’ll need to allow the internal IP address range used by the swap process:

15 thoughts on “Most common deployment slot swap failures and how to fix them”

  1. The behaviour we see in the Azure console is not consistent.

    We have a NodeJS web app and when we swap dev and prod slots the Notifications continues to show that it is swapping but if we go to Activity Log is shows that it completed 15 minutes ago (or whatever interval has passed).

    The notifications not showing the latest status causes some concern.

  2. Hi Rob, thanks for reporting this. Yes, I’ve seen a few reports that the notification is not updated when the swap completes or fails. We’ll investigate what is going on. Meanwhile, using an activity log is a reliable way to figure out the status of the swap operation.

  3. Does Swap with Preview work with Site Authentication? You mention Site Authentication above in #1, but attempts to swap a test site from staging to production via PowerShell results in “Swap with Preview cannot be used when one of the slots has site authentication enabled”. It seems like it should work… as it should treat the Site Authentication/Authorization settings as “slot” settings and keep them where they are, only swapping the code and other App Settings.

  4. Hi Ian, the swap with preview does not work with site auth. This is because during the first stage of the swap the production site auth settings are applied to the staging slot but the slot still has the staging host name. So all the site auth redirection logic gets broken because of that.

  5. Hi, this article is still showing up in IIS Manager’s “IIS News” section. You should tell your contacts it’s still stuck there and should be untagged as IIS news.

  6. Hi,

    I’m trying to understand exactly what happens with the “sticky to slot” control.

    Here’s my current understanding.

    A slot contains a set of application files. Two slots can contain slightly different versions of the application files, and can have different connection strings.

    During a deployment slot swap the files aren’t copied, it’s just the DNS’s that are pointed differently.

    So the term “production slot” really means the set of files that the DNS points to, even if the actual physical slot that is pointed to (i.e. set of files) is changed.

    My question is, when a connection string is made “sticky to the production slot”, and the “swap” occurs, what exactly happens to the production connection string data (or app.config data, etc.)?

    1) Is that data copied to the “new” production slot, like a real file copy? (it seems like that would ‘confuse’ what was actually in the two different file sets)
    2) Is it “pointed to” from the “new” production slot (it does not seem that would be reliable, because the “old” production slot could be edited, throwing off the function of the “new” slot)
    3) Does Azure manage a pointer to the connection string for the production slot that is separate from the app code files?
    3.1) If so, how does it do that? Does it keep a separate copy of the connection string data? If so, how does Azure update that data if the “real” connection string data (which might be encrypted, etc.) changes?

    Any help with this would be appreciated.

    Thanks!

  7. Hi,
    An enlightening post thanks!

    I was wondering with the release of the new ip restrictions on the front end worker (if I understand the feature correctly), do you still need to do your last point?

    “Web App has IP restriction rules that prevent the swap process from connecting to it. In that case you’ll need to allow the internal IP address range used by the swap process”

    thanks
    Alex.

  8. Hi,

    If I configure to use Local Cache so the staging slot will restart 2 times?
    first time for applying app setting coying from production slot and the second time after coying sites data to local cache?

    Thank you.

    1. The warmup requests are sent to the web workers from different VMs, and those VMs are on the same internal network. That’s why their IP addresses start with 10 or 100. I haven’t heard about the need to add the loopback address. I wonder if this is maybe specific to the Sitecore-based applications only?

  9. I setup slot deployment, and everything works great except that after the swap of staging and production slots, all the traffic appears to go to the staging endpoint even though 100% is set to go to the production slot. Both staging and production service instances log to the same App Insights instance, where I can see this issue. App service restart mitigates this issue and forces the traffic to go to the production slot until the next deployment when another slot swap happens. How can I troubleshoot and fix this issue? Thanks!

  10. I’ve been working with Microsoft techs for over a wek trying to get a new version of my code published to a deployment slot. All we get is errors. This is critical to our business since information on the site is invalid and fines from the EPA are happening.

    Here is the TrackingID#2205040040004153

    I am so fed up with trying to get this work, I’ve investigated moving to AWS.

Leave a Reply

Your email address will not be published. Required fields are marked *