L2 caching problem, MSMQ

14 posts, 0 answers
  1. Daniel
    Daniel avatar
    9 posts
    Member since:
    Mar 2013

    Posted 12 Sep 2014 Link to this post

    Hi there!

    I have problems using the L2 cache of Open Access in a distributed environment. First I describe the environment then the problem.

    Open Access Version: 2014.2.617.2
    Participating applications:

    A .NET 4.5 console application started by a scheduled task with the System-Account
    A ASP.Net MVC application running under an IIS-User ("IIS AppPool\Nareg")
    A WCF Webservice application running in IIS under a IIS-User ("IIS AppPool\NaregWebservice")

    The console application is started once a day and running about 10min. The IIS-applications are running all day and are recycled after 24hours.

    The L2-cache configuration is the same for all applications:

    <openAccessConfiguration>
      <backendConfiguration name="SecondLevelCacheConfiguration">
        <secondLevelCache enabled="true" numberOfObjects="1000" cacheQueryResults="true" numberOfQueryResults="2000" numberOfObjectsPerQueryResults="550">
          <synchronization enabled="true" name="MSMQ" localpath="nareg1;nareg2;nareg3" multicastAddress="224.1.1.1:666" app="1" expirationMSec="3000" statusDurationMSec="4000" initialMessageTimeoutMsec="1000">
          </synchronization>
        </secondLevelCache>
      </backendConfiguration>
    </openAccessConfiguration>

    The problem:
    Most of the time the applications are running fine and the cache seems to be working. Changed entities in one application will be refetched from the database by the other applications.
    But the console application sometimes has problems to startup. The stacktrace:
    OpenAccessRuntime.MSMQClusterTransport.Telerik.OpenAccess.Cluster.OpenAccessClusterTransport.Init(OpenAccessClusterMsgHandler msgHandler, String _serverName, String ident, IOpenAccessClusterTransportLog log)
       bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryBase.createStorageCache()
       bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryBase.init()
       bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryImp.init()
       bei OpenAccessRuntime.DataObjects.BootstrapPMF.getPersistenceManagerFactory(PropertySet props)
       bei OpenAccessRuntime.Helper.getPersistenceManagerFactory(PropertySet props)
       bei OpenAccessRuntime.DatabaseAdapter.AssertPersistenceManagerFactory(String usr, String password, Boolean open)
       bei OpenAccessRuntime.DatabaseAdapter.GetObjectScope(TransactionProvider provider)
       bei Telerik.OpenAccess.Database.GetObjectScope(TransactionProvider provider)
       bei Telerik.OpenAccess.OpenAccessContextBase.GetScope()
       bei Telerik.OpenAccess.OpenAccessContext.GetAllCore[T]()
       bei Telerik.OpenAccess.OpenAccessContext.GetAll[T]()
       bei Nareg.Data.OpenAccess.Repository.Repository`1.FindAll(Expression`1 predicate) in ...
    Message: Initial check message was not received in 1000 milliseconds, which indicates a non-working multicast network. Parameters were: MulticastAddr: 224.1.1.1:666, Localpath: nareg1;nareg2;nareg3

    So my question: Is my setup correct? I ask because I found a config parameter "AdministrationQueue" - should I setup this parameter?
    What else could lead to this problem?

    Thanks a lot!


  2. Thomas
    Admin
    Thomas avatar
    590 posts

    Posted 16 Sep 2014 Link to this post

    Hello Daniel,

    no, I think the administration queue will not help. The failure occurs, because the following happens:
    In order to test the setup upon database open, we place an initial message in the message queue that is specified, and try to receive this message intra-process. If this message does not arrive in a certain amount
    of time, we deem the setup broken. This is what happens on your side after one second.
    I would try to increase the initialMessageTimeoutMsec to a value of 5000. The worst thing that can happen is that the attempt to open the database is taking much longer (up to 5 seconds) in the case that the setup is (for some unknown reason) invalid. But this is an error case anyways. Chances are, the message gets received after 1.2 seconds or so and all is fine. 

    Regards,
    Thomas
    Telerik
     
    OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
     
  3. DevCraft banner
  4. Daniel
    Daniel avatar
    9 posts
    Member since:
    Mar 2013

    Posted 24 Sep 2014 in reply to Thomas Link to this post

    Hi Thomas

    Thanks for your answer. I increased the timeout to 10000ms in every application. But today the error occurred again. Twice in a row - the third application start of the console application went ok.

    What else could lead to this error? Do the applications share one queue? Does every application need full access on every queue? Is there some kind of fallback and we are running out of queues?

    In short: It would be nice if you could elaborate a bit the mechanism of this caching.

    Thank you and greetings!
  5. Ralph Waldenmaier
    Admin
    Ralph Waldenmaier avatar
    202 posts

    Posted 25 Sep 2014 Link to this post

    Hi Daniel,
    You might find this blog post interesting where the behavior of the 2nd level cache in web applications is explained.

    The recommendations based on the provided details is to have 5 queues. 2 for each process in IIS and one for the console application. You need 2 queues for a IIS application because of the overlapped recycling mechanism. Here the queues are not freed up before the new process is started and therefore you are getting troubles accessing the defined queues if you have only one queue available per IIS process.

    Additionally please take care that you are disposing the database object in your Application_End method of your web application. The reason is, that otherwise the queue handles will not be freed until you restart your machine and thus will cause additional issues.

    I hope this information is helpful for you.
    Do come back in case you need further assistance.

    Regards,
    Ralph Waldenmaier
    Telerik
     
    OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
     
  6. Daniel
    Daniel avatar
    9 posts
    Member since:
    Mar 2013

    Posted 25 Nov 2014 in reply to Ralph Waldenmaier Link to this post

    Hi Ralph

    Thanks for your answer and suggestions. Now my setup is as you described. The console application has its own queue and closes the queue also upon exit/exception with database-dispose.
    Although I still having problems. The exception still occurs. I also have set the BindInterfaceIP and ClusterBindIP IP in the registry, as the server has two (virtual) network interfaces (vmxnet3-Adapters).
    Do I have a chance to trace the problem down to its cause? It's only the startup behaviour - once the processes are running, the 2nd Level cache works as expected.
    I'm stuck right now, do you have any suggestions?

    The exception:
    Stacktrace:    bei OpenAccessRuntime.MSMQClusterTransport.Telerik.OpenAccess.Cluster.OpenAccessClusterTransport.Init(OpenAccessClusterMsgHandler msgHandler, String _serverName, String ident, IOpenAccessClusterTransportLog log)
       bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryBase.createStorageCache()
       bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryBase.init()
       bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryImp.init()
       bei OpenAccessRuntime.DataObjects.BootstrapPMF.getPersistenceManagerFactory(PropertySet props)
       bei OpenAccessRuntime.Helper.getPersistenceManagerFactory(PropertySet props)
       bei OpenAccessRuntime.DatabaseAdapter.AssertPersistenceManagerFactory(String usr, String password, Boolean open)
       bei OpenAccessRuntime.DatabaseAdapter.GetObjectScope(TransactionProvider provider)
       bei Telerik.OpenAccess.Database.GetObjectScope(TransactionProvider provider)
       bei Telerik.OpenAccess.OpenAccessContextBase.GetScope()
       bei Telerik.OpenAccess.OpenAccessContext.GetAllCore[T]()
       bei Telerik.OpenAccess.OpenAccessContext.GetAll[T]()
       bei Nareg.Data.OpenAccess.Repository.Repository`1.FindAll(Expression`1 predicate) in c:\Daten\Source\gdk\Nareg\Nareg.Data.OpenAccess\Repository\Repository.cs:Zeile 310.
       bei Nareg.Business.Service.GlnService.GetPersonsThatNeedANewGlnOrUpdate(Int32 takeCount) in c:\Daten\Source\gdk\Nareg\Nareg.Business\Service\GlnService.cs:Zeile 203.
       bei Nareg.Business.Service.GlnService.GetPersonsThatNeedANewGlnOrUpdate() in c:\Daten\Source\gdk\Nareg\Nareg.Business\Service\GlnService.cs:Zeile 106.
       bei Nareg.GlnService.Program.Main(String[] args) in c:\Daten\Source\gdk\Nareg\Nareg.GlnService\Program.cs:Zeile 116.
    Message: Initial check message was not received in 5000 milliseconds, which indicates a non-working multicast network. Parameters were: MulticastAddr: 224.1.1.1:666, Localpath: nareg-glnservice-acceptance1;nareg-glnservice-acceptance2

    Best Regards!

  7. Ralph Waldenmaier
    Admin
    Ralph Waldenmaier avatar
    202 posts

    Posted 25 Nov 2014 Link to this post

    Hello Daniel,
    Have you according to this link also specified the \HKLM\Software\Microsoft\MSMQ\Parameters\MulticastBindIP registry setting to tackle the correct IP address. Probably this is different from the ones you mentioned.

    Hope this helps.
    Feel free to ask in case you have any other question.


    Regards,
    Ralph Waldenmaier
    Telerik
     
    OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
     
  8. Daniel
    Daniel avatar
    9 posts
    Member since:
    Mar 2013

    Posted 25 Nov 2014 in reply to Ralph Waldenmaier Link to this post

    Hi Ralph

    Thanks for your answer. Indeed, I forget to set this entry. Now I have three registry entries pointing all on the same IP (BindInterfaceIP, ClusterBindIP and MulticastBindIP). I restarted the msmq service and now the binding seems to work:

    C:\Users\Administrator>netstat -ano | findstr 1801
      TCP    192.168.81.33:1801     0.0.0.0:0              ABHÖREN         2772

    The problem persists. I also (temporarely) deactivated the second network adapter in network and filesharing center, but this hadn't no effect. IPv6 is deactivated on the primary adapter - is this a problem?
    I would say the problems occurs every 2nd time (but with no reproducible sequence). Also the private message-queue is created every time (also if the exception occurs) - so this should be ok.

    Greetings
  9. Ralph Waldenmaier
    Admin
    Ralph Waldenmaier avatar
    202 posts

    Posted 26 Nov 2014 Link to this post

    Hi Daniel,
    The private queues will get generated if they do not exist.
    In order to avoid and  security issues here, would it be possible for you to create the needed queues ( 5 in your case ) manually, with the needed multicast settings and also the needed security settings that match your requirements, and see if you then still have these timeouts. By doing this you avoid the creation of the queues and any security issues.
    Since it is not consistently failing in your environment, it is a bit hard to exactly tell you what is happening.

    Hope this helps.

    Regards,
    Ralph Waldenmaier
    Telerik
     
    OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
     
  10. Daniel
    Daniel avatar
    9 posts
    Member since:
    Mar 2013

    Posted 01 Dec 2014 in reply to Ralph Waldenmaier Link to this post

    Hi Ralph
    Thanks for your answer. The security settings seem OK. I have now wireshark running and see the messages appearing on 224.1.1.1.
    The initial message is posted every time - even when the described exception comes. So the message queue seems to be working, the inital message is posted but not received within the 5 seconds. Please see the attached screenshot:
    MSMQ Screenshot

    The first 3 messages are posted as soon as the context-method "GetAll" is called. Then the timeout occurs and finally the exception handler disposes the database (4th message).
    So it seems that the listenener (thread?) isn't receiving the message although it has been posted. As I don't know how this initial-message check is implemented, I can't debug this further.
    I really rely on this mechanism, as I can't disable the cache at all in all applications. Any workarounds are appreciated...

    Greetings
  11. Daniel
    Daniel avatar
    9 posts
    Member since:
    Mar 2013

    Posted 02 Dec 2014 in reply to Daniel Link to this post

    Hi Ralph

    Probably I have found a possible bug in your Init-Code in class MSMQClusterImpl of Telerik.OpenAccess.Runtime.
    Please see the comments of the decompiled source (sorry...) of your code:

    internal MSMQClusterImpl(OpenAccessClusterMsgHandler h, bool dbg, string multi, string loc, int ap, string humanName, Guid sysName, bool pc, AsynchronousInputHandler ah, int expirMS, string adminP, int initialMessageTimeout, IOpenAccessClusterTransportLog log)
        {
          this.useTracing = dbg;
          this.multicastaddr = multi;
          this.localPath = loc;
          this.handler = h;
          this.pubName = humanName;
          this.guid = sysName;
          this.app = ap;
          this.performancecounter = pc;
          this.asyncHandler = ah;
          this.adminPath = adminP;
          this.log = log;
          this.expirationTTRQ = new TimeSpan(0, 0, 0, expirMS / 1000, expirMS % 1000);
          this.expirationTTBR = this.expirationTTRQ + new TimeSpan(0, 2, 0);
          this.initialMessageTimeout = new TimeSpan(0, 0, 0, 0, initialMessageTimeout);
          this.initialWaitInstance = new object();
          try
          {
            this.Initialize(); // Runs the Thread "ReceiverRun" -> see below
            this.SendInternal((byte[]) null, this.squeue, 0, 0, (string) null, 7);
            lock (this.initialWaitInstance) // <- possible problem: If this code is called first,
                                   // the receiver-thread
                                   // cannot pulse the monitor as it is already locked
                                   // -> the Monitor.Wait
                                   // times out and this.closed remains false.
            {
              if (Monitor.Wait(this.initialWaitInstance, initialMessageTimeout))
                return;
              this.closed = true;
            }
          }
          catch
          {
            this.Dispose();
            throw;
          }
        }
         
        private void ReceiverRun()
        {
          TimeSpan ts = new TimeSpan(0, 0, 5);
          bool flag = false;
          byte[] defBuf = new byte[1024];
          AsynchronousInputHandler workOn = this.asyncHandler;
          try
          {
            while (!this.closed)
            {
              try
              {
                using (Message m = this.Receive(ref ts))
                {
                  if (m != null)
                  {
                    if (!flag)
                    {
                      lock (this.initialWaitInstance) // cannot get lock, if already locked by MSMQClusterImpl-ctor
                        Monitor.Pulse(this.initialWaitInstance);
                      flag = true;
                    }
                    if (!this.closed)
                    {
                      if (m.AppSpecific == this.app)
                        this.ProcessMessage(m, workOn, defBuf);
                    }
                  }
                }
              }
              catch (Exception ex)
              {
                if (!this.closed)
                  this.log.LogError(ex, "Cluster: Receive loop got exception.", new object[0]);
              }
            }
          }
          catch (Exception ex)
          {
            if (!this.closed)
              this.log.LogError(ex, "Cluster: Receive loop stopped.", new object[0]);
          }
          lock (this.receiverDone)
            Monitor.PulseAll(this.receiverDone);
        }

    What do you think? This would explain the inconsistent behaviour as sometimes one part of the code is faster (the thread runs with "AboveNormal" priority, so most of the time the Monitor.Pulse can be executed). Am I totally wrong?

    Best regards
  12. Ralph Waldenmaier
    Admin
    Ralph Waldenmaier avatar
    202 posts

    Posted 03 Dec 2014 Link to this post

    Hello Daniel,
    Thank you for getting back to us.
    The behavior of Monitor.Wait and Monitor.Pulse is a bit different. The lock for the 'initialWaitInstance' is released as soon as the Pulse or Wait method was called, to allow other threads to access the lock object again.

    Nevertheless, and this could be the situation in your case, it can be that the message is returned that fast, that the Wait is not yet called and thus, the Pulse call is triggered but not 'heard' by the Wait method. This could cause the problem in your environment. We have revisited the code again and are now doing additional checks to avoid such a situation. The fix will be part of the next release of Telerik Data Access. Unfortunately, I can not provide you with an eta for the next release.

    Additional to that, I have attached our CacheCluster tool. This allows you to use the same mechanisms as used within Telerik Data Access to check the messaging behavior. You can just start two or more instances and check if the messaging works as expected in your environment. Do you face the same problems here as well?

    Again, thank you for pointing us in that direction.
    Do come back in case you need further assistance.


    Regards,
    Ralph Waldenmaier
    Telerik
     
    OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
     
  13. Daniel
    Daniel avatar
    9 posts
    Member since:
    Mar 2013

    Posted 03 Dec 2014 in reply to Ralph Waldenmaier Link to this post

    Thanks Ralph.

    I'm glad you found a possible cause for this issue. I misunderstood the behaviour of Monitor.Wait/Pulse (although is is quite basic stuff...).

    The CacheCluster tool seems to work well: The "Check MSMQ" is always green (tested 20 times). The pings are received by all three instances of the program. The receive-count stays 0 in all instances of CacheCluster. Tested on the developer machine and on the production server.

    I'm looking forward for the (possible) fix. I guess there will be a remark in the changelog.

    Thanks and best regards,
    Daniel

  14. Ralph Waldenmaier
    Admin
    Ralph Waldenmaier avatar
    202 posts

    Posted 05 Dec 2014 Link to this post

    Hi Daniel,
    Yes there will be a remark in the change log about the fix.
    Since the Cache Cluster tool does its job in your environment, it is most likely that there is some kind of a race condition in your live environment that is causing the observed behavior. The fix should definitely help here.

    I would like to thank you again for bringing this to our attention.
    Feel free to ask in case you have any other question..

    Regards,
    Ralph Waldenmaier
    Telerik
     
    OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
     
  15. Ralph Waldenmaier
    Admin
    Ralph Waldenmaier avatar
    202 posts

    Posted 15 Dec 2014 Link to this post

    Hi Daniel,
    I just wanted to update you that a new version of Telerik Data Access has been released that contains the previously mentioned fix. It hopefully fixes the issue in your environment as well.

    Do come back in case you have any other question.

    Regards,
    Ralph Waldenmaier
    Telerik
     
    OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
     
Back to Top
DevCraft banner