Question

我们已经为我们的Web应用程序实现了AppFabric Windows Server Cache。最初，我们能够毫无问题地使用缓存。然后我们将流量增加了大约100倍，并开始出现间歇性异常。例外情况大约每2天发生一次，大约每分钟发生一次。

我们的配置：

9个Web服务器在缓存中插入/检索对象：
- 主要为临时500字节操作类型对象
- 使用1个指定区域
- 使用标签存储的对象
- 批量检索给定标记
缓存群集：
- 1个主机（主管）AppFabric 1.1（get-cachehost报告的版本为3）
- SQL配置提供程序
- 主机上有96GB RAM，默认50％（48GB）分配给AppFabric
- 缓存主机Config
- 缓存客户端Config

错误按顺序发生（在1分钟内九个网络服务器中的每一个发生例外）：

System.Net.Sockets.SocketException：远程主机强行关闭现有连接 Microsoft.ApplicationServer.Caching.DataCacheException：ErrorCode<ERRCA0016>:SubStatus<ES0001>:The connection was terminated, possibly due to server or network problems or serialized Object size is greater than MaxBufferSize on server. Result of the request is unknown. ---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:15:00'. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host --- End of inner exception stack trace --- at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result) at System.ServiceModel.Channels.FramingDuplexSessionChannel.EndReceive(IAsyncResult result) at Microsoft.ApplicationServer.Caching.WcfClientChannel.CompleteProcessing(IAsyncResult result) --- End of inner exception stack trace --- at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody) at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more) at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext() at System.Linq.Enumerable.WhereSelectEnumerableIterator'2.MoveNext() at System.Linq.Enumerable.<ExceptIterator>d__99'1.MoveNext() at System.Collections.Generic.List'1..ctor(IEnumerable'1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable'1 source)
Microsoft.ApplicationServer.Caching.DataCacheException： ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.) at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody) at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more) at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext() at System.Linq.Enumerable.WhereSelectEnumerableIterator'2.MoveNext() at System.Linq.Enumerable.<ExceptIterator>d__99'1.MoveNext() at System.Collections.Generic.List'1..ctor(IEnumerable'1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable'1 source)
Microsoft.ApplicationServer.Caching.DataCacheException： ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out. at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody) at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more) at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext() at System.Linq.Enumerable.WhereSelectEnumerableIterator'2.MoveNext() at System.Linq.Enumerable.<ExceptIterator>d__99'1.MoveNext() at System.Collections.Generic.List'1..ctor(IEnumerable'1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable'1 source)

我们还在缓存服务器上创建了一个tracelog会话，以捕获更多信息来诊断问题 - 任何有关如何分析此问题的建议都将受到赞赏（如果需要，我可以提供此建议）。

我们还监控了各种AppFabric，CLR和网络性能计数器，下面是事件的屏幕截图：

AppFabric Perfmon Capture

提前感谢您就解决此问题可以分享的任何想法或建议。

更新1

以下是在间歇性错误（从tracelogs中抽象）期间AppFabric缓存服务器上连续出现的异常：

System.ServiceModel.CommunicationException: The socket connection was aborted because an asynchronous send to the socket did not complete within the allotted timeout of 00:00:00.0082078. The time allotted to this operation may have been a portion of a longer timeout. ---> System.ObjectDisposedException: The socket connection has been disposed. Object name: 'System.ServiceModel.Channels.SocketConnection'. --- End of inner exception stack trace --- at System.ServiceModel.Channels.SocketConnection.ThrowIfNotOpen() at System.ServiceModel.Channels.SocketConnection.BeginRead(Int32 offset, Int32 size, TimeSpan timeout, WaitCallback callback, Object state) at System.ServiceModel.Channels.SessionConnectionReader.BeginReceive(TimeSpan timeout, WaitCallback callback, Object state) at System.ServiceModel.Channels.SynchronizedMessageSource.ReceiveAsyncResult.PerformOperation(TimeSpan timeout) at System.ServiceModel.Channels.SynchronizedMessageSource.SynchronizedAsyncResult'1..ctor(SynchronizedMessageSource syncSource, TimeSpan timeout, AsyncCallback callback, Object state) at System.ServiceModel.Channels.FramingDuplexSessionChannel.BeginReceive(TimeSpan timeout, AsyncCallback callback, Object state) at Microsoft.ApplicationServer.Caching.WcfServerChannel.CompleteProcessing(IAsyncResult result)
System.ServiceModel.CommunicationObjectAbortedException: The communication object, System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel, cannot be used for communication because it has been Aborted. at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result) at System.ServiceModel.Channels.FramingDuplexSessionChannel.OnEndSend(IAsyncResult result) at Microsoft.ApplicationServer.Caching.ReplyContext.EndSend(IAsyncResult result)
System.ServiceModel.CommunicationObjectFaultedException: The communication object, System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel, cannot be used for communication because it is in the Faulted state. at System.ServiceModel.Channels.CommunicationObject.ThrowIfDisposedOrNotOpen() at System.ServiceModel.Channels.OutputChannel.Send(Message message, TimeSpan timeout) at Microsoft.ApplicationServer.Caching.ReplyContext.Reply(Message message, TimeSpan timeout)
System.TimeoutException: Sending to via http://www.w3.org/2005/08/addressing/anonymous timed out after 00:00:15. The time allotted to this operation may have been a portion of a longer timeout. ---> System.TimeoutException: Cannot claim lock within the allotted timeout of 00:00:15. The time allotted to this operation may have been a portion of a longer timeout. --- End of inner exception stack trace --- at System.ServiceModel.Channels.FramingDuplexSessionChannel.OnSend(Message message, TimeSpan timeout) at System.ServiceModel.Channels.OutputChannel.Send(Message message, TimeSpan timeout) at Microsoft.ApplicationServer.Caching.ReplyContext.Reply(Message message, TimeSpan timeout)

更新2

经过一天的故障排除后，我们采取了以下措施，这些措施取得了一些进展：

根据this和this，我们将maxConnectionsToServer增加到3。因此，我们通过 AppFabric缓存：缓存性能计数器记录的客户端请求/秒增加了50％，但间歇性错误并未停止发生
我们在缓存服务器配置上将maxBufferSize和maxBufferPoolSize增加到2147483647（int32.max）。到目前为止，我们能够处理300x流量没有错误。我们将继续增加交通量和监控。更多要关注的更新

更新3

我们向群集添加了另外两台主机，每台16GB，并启用了HighAvailability模式（通过Secondaries=1）。目前，原始主机保留在具有96GB RAM的群集中 - 所有主机都具有cacheSize = 12 GB。在缓存客户端上，我们将MaxConnectionToServer增加到12（每个核心1个）。以下是我们的调查结果：

偶尔我们得到（每10分钟一次或两次）：
- ErrorCode<ERRCA0017>:SubStatus<ES0005>:There is a temporary failure. Please retry later. (There was a contention on the store.)
- ErrorCode<ERRCA0017>:SubStatus<ES0004>:There is a temporary failure. Please retry later. (Replication queue was full. This may happen during reconfiguration of cluster hosts.)
如上所述，原始的96GB缓存主机仍然会遇到1分钟的中断。新的缓存主机没有经历中断

我们计划从原始缓存主机中删除80GB内存。更多要关注的更新。

更新4

通过将缓存主机中的RAM量减少到16GB，似乎解决了这个问题。我们不再看到流量增加到400倍的间歇性错误。似乎是封闭的。现在转到下一期：High Availability

Answer 1

您安装了http://support.microsoft.com/kb/983182和http://support.microsoft.com/kb/2527387吗？
在您的代码中，您是否检查了异常和重试bool？
```
                catch (DataCacheException ex2)
            {
                if (ex2.ErrorCode == DataCacheErrorCode.RetryLater)
                {
```
使用命名区域会强制服务器将该命名区域的值推送到单个服务器，而不是在所有缓存服务器上分散哈希值。（“为了提供此添加的搜索功能，区域中的对象仅限于单个缓存主机。”http://msdn.microsoft.com/en-us/library/ee790985(v=azure.10).aspx）

我建议您将另外两台服务器上的命名区域分片并将它们放入群集中。这样，当您运行GC并尝试查找更多ram以放置和存储对象和标记时，您可以将例外限制为较小的服务器。

Answer 2

在social.msdn.microsoft.com上重新发布 Jeff-ITGuy 给出的答案：

您似乎遇到的问题几乎与我目前与Microsoft合作的问题相同。如果是同一个问题，可能是因为GC需要很长时间并导致AppFabric的响应时间延迟。从您的性能计数器看起来，当您开始解决问题时，GC时间会突然显示，因此可能是同样的问题。

Microsoft正在积极调查此问题。与此同时，为了缓解问题（至少从我们的发现中），您可以运行更多内存更少的服务器（缩小GC正在使用的内存空间的大小），并且可以在客户端上增加RequestTimeout。默认设置为15000（15秒），但我们已尝试将其提高到30000，这有助于消除一些问题。在我看来，这不是一个很好的长期解决方案，只是传递信息。我已经看到了服务器只有24GB内存（缓存为12gb）的问题，当我们尝试将4gb设置为缓存的8gb服务器时，它真的明显变得更好 - 这导致GC做得更好。

希望有所帮助，但如果这是问题，我认为现在还没有解决方案。

它确实有帮助，在我们将缓存主机RAM减少到16GB之后，间歇性错误停止了。

Answer 3

此问题的解决方案目前可在此处获得： http://support.microsoft.com/kb/2787717

排除AppFabric扩展问题（间歇性错误代码<errca0017>：SubStatus <es0006>错误）</es0006> </errca0017>

更新1

更新2

更新3

更新4

3 个答案: