This question is locked. New answers and comments are not allowed.
Hi there!
I have problems using the L2 cache of Open Access in a distributed environment. First I describe the environment then the problem.
Open Access Version: 2014.2.617.2
Participating applications:
A .NET 4.5 console application started by a scheduled task with the System-Account
A ASP.Net MVC application running under an IIS-User ("IIS AppPool\Nareg")
A WCF Webservice application running in IIS under a IIS-User ("IIS AppPool\NaregWebservice")
The console application is started once a day and running about 10min. The IIS-applications are running all day and are recycled after 24hours.
The L2-cache configuration is the same for all applications:
<
openAccessConfiguration
>
<
backendConfiguration
name
=
"SecondLevelCacheConfiguration"
>
<
secondLevelCache
enabled
=
"true"
numberOfObjects
=
"1000"
cacheQueryResults
=
"true"
numberOfQueryResults
=
"2000"
numberOfObjectsPerQueryResults
=
"550"
>
<
synchronization
enabled
=
"true"
name
=
"MSMQ"
localpath
=
"nareg1;nareg2;nareg3"
multicastAddress
=
"224.1.1.1:666"
app
=
"1"
expirationMSec
=
"3000"
statusDurationMSec
=
"4000"
initialMessageTimeoutMsec
=
"1000"
>
</
synchronization
>
</
secondLevelCache
>
</
backendConfiguration
>
</
openAccessConfiguration
>
The problem:
Most of the time the applications are running fine and the cache seems to be working. Changed entities in one application will be refetched from the database by the other applications.
But the console application sometimes has problems to startup. The stacktrace:
OpenAccessRuntime.MSMQClusterTransport.Telerik.OpenAccess.Cluster.OpenAccessClusterTransport.Init(OpenAccessClusterMsgHandler msgHandler, String _serverName, String ident, IOpenAccessClusterTransportLog log)
bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryBase.createStorageCache()
bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryBase.init()
bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryImp.init()
bei OpenAccessRuntime.DataObjects.BootstrapPMF.getPersistenceManagerFactory(PropertySet props)
bei OpenAccessRuntime.Helper.getPersistenceManagerFactory(PropertySet props)
bei OpenAccessRuntime.DatabaseAdapter.AssertPersistenceManagerFactory(String usr, String password, Boolean open)
bei OpenAccessRuntime.DatabaseAdapter.GetObjectScope(TransactionProvider provider)
bei Telerik.OpenAccess.Database.GetObjectScope(TransactionProvider provider)
bei Telerik.OpenAccess.OpenAccessContextBase.GetScope()
bei Telerik.OpenAccess.OpenAccessContext.GetAllCore[T]()
bei Telerik.OpenAccess.OpenAccessContext.GetAll[T]()
bei Nareg.Data.OpenAccess.Repository.Repository`1.FindAll(Expression`1 predicate)
in
...
Message: Initial check message was not received
in
1000 milliseconds, which indicates a non-working multicast network. Parameters were: MulticastAddr: 224.1.1.1:666, Localpath: nareg1;nareg2;nareg3
So my question: Is my setup correct? I ask because I found a config parameter "AdministrationQueue" - should I setup this parameter?
What else could lead to this problem?
Thanks a lot!
13 Answers, 1 is accepted
0
Hello Daniel,
no, I think the administration queue will not help. The failure occurs, because the following happens:
In order to test the setup upon database open, we place an initial message in the message queue that is specified, and try to receive this message intra-process. If this message does not arrive in a certain amount
of time, we deem the setup broken. This is what happens on your side after one second.
I would try to increase the initialMessageTimeoutMsec to a value of 5000. The worst thing that can happen is that the attempt to open the database is taking much longer (up to 5 seconds) in the case that the setup is (for some unknown reason) invalid. But this is an error case anyways. Chances are, the message gets received after 1.2 seconds or so and all is fine.
Regards,
Thomas
Telerik
no, I think the administration queue will not help. The failure occurs, because the following happens:
In order to test the setup upon database open, we place an initial message in the message queue that is specified, and try to receive this message intra-process. If this message does not arrive in a certain amount
of time, we deem the setup broken. This is what happens on your side after one second.
I would try to increase the initialMessageTimeoutMsec to a value of 5000. The worst thing that can happen is that the attempt to open the database is taking much longer (up to 5 seconds) in the case that the setup is (for some unknown reason) invalid. But this is an error case anyways. Chances are, the message gets received after 1.2 seconds or so and all is fine.
Regards,
Thomas
Telerik
OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
0
Daniel
Top achievements
Rank 1
answered on 24 Sep 2014, 10:15 AM
Hi Thomas
Thanks for your answer. I increased the timeout to 10000ms in every application. But today the error occurred again. Twice in a row - the third application start of the console application went ok.
What else could lead to this error? Do the applications share one queue? Does every application need full access on every queue? Is there some kind of fallback and we are running out of queues?
In short: It would be nice if you could elaborate a bit the mechanism of this caching.
Thank you and greetings!
Thanks for your answer. I increased the timeout to 10000ms in every application. But today the error occurred again. Twice in a row - the third application start of the console application went ok.
What else could lead to this error? Do the applications share one queue? Does every application need full access on every queue? Is there some kind of fallback and we are running out of queues?
In short: It would be nice if you could elaborate a bit the mechanism of this caching.
Thank you and greetings!
0
Hi Daniel,
You might find this blog post interesting where the behavior of the 2nd level cache in web applications is explained.
The recommendations based on the provided details is to have 5 queues. 2 for each process in IIS and one for the console application. You need 2 queues for a IIS application because of the overlapped recycling mechanism. Here the queues are not freed up before the new process is started and therefore you are getting troubles accessing the defined queues if you have only one queue available per IIS process.
Additionally please take care that you are disposing the database object in your Application_End method of your web application. The reason is, that otherwise the queue handles will not be freed until you restart your machine and thus will cause additional issues.
I hope this information is helpful for you.
Do come back in case you need further assistance.
Regards,
Ralph Waldenmaier
Telerik
You might find this blog post interesting where the behavior of the 2nd level cache in web applications is explained.
The recommendations based on the provided details is to have 5 queues. 2 for each process in IIS and one for the console application. You need 2 queues for a IIS application because of the overlapped recycling mechanism. Here the queues are not freed up before the new process is started and therefore you are getting troubles accessing the defined queues if you have only one queue available per IIS process.
Additionally please take care that you are disposing the database object in your Application_End method of your web application. The reason is, that otherwise the queue handles will not be freed until you restart your machine and thus will cause additional issues.
I hope this information is helpful for you.
Do come back in case you need further assistance.
Regards,
Ralph Waldenmaier
Telerik
OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
0
Daniel
Top achievements
Rank 1
answered on 25 Nov 2014, 08:28 AM
Hi Ralph
Thanks for your answer and suggestions. Now my setup is as you described. The console application has its own queue and closes the queue also upon exit/exception with database-dispose.
Although I still having problems. The exception still occurs. I also have set the BindInterfaceIP and ClusterBindIP IP in the registry, as the server has two (virtual) network interfaces (vmxnet3-Adapters).
Do I have a chance to trace the problem down to its cause? It's only the startup behaviour - once the processes are running, the 2nd Level cache works as expected.
I'm stuck right now, do you have any suggestions?
The exception:
Best Regards!
Thanks for your answer and suggestions. Now my setup is as you described. The console application has its own queue and closes the queue also upon exit/exception with database-dispose.
Although I still having problems. The exception still occurs. I also have set the BindInterfaceIP and ClusterBindIP IP in the registry, as the server has two (virtual) network interfaces (vmxnet3-Adapters).
Do I have a chance to trace the problem down to its cause? It's only the startup behaviour - once the processes are running, the 2nd Level cache works as expected.
I'm stuck right now, do you have any suggestions?
The exception:
Stacktrace: bei OpenAccessRuntime.MSMQClusterTransport.Telerik.OpenAccess.Cluster.OpenAccessClusterTransport.Init(OpenAccessClusterMsgHandler msgHandler, String _serverName, String ident, IOpenAccessClusterTransportLog log)
bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryBase.createStorageCache()
bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryBase.init()
bei OpenAccessRuntime.DataObjects.PersistenceManagerFactoryImp.init()
bei OpenAccessRuntime.DataObjects.BootstrapPMF.getPersistenceManagerFactory(PropertySet props)
bei OpenAccessRuntime.Helper.getPersistenceManagerFactory(PropertySet props)
bei OpenAccessRuntime.DatabaseAdapter.AssertPersistenceManagerFactory(String usr, String password, Boolean open)
bei OpenAccessRuntime.DatabaseAdapter.GetObjectScope(TransactionProvider provider)
bei Telerik.OpenAccess.Database.GetObjectScope(TransactionProvider provider)
bei Telerik.OpenAccess.OpenAccessContextBase.GetScope()
bei Telerik.OpenAccess.OpenAccessContext.GetAllCore[T]()
bei Telerik.OpenAccess.OpenAccessContext.GetAll[T]()
bei Nareg.Data.OpenAccess.Repository.Repository`1.FindAll(Expression`1 predicate)
in
c:\Daten\Source\gdk\Nareg\Nareg.Data.OpenAccess\Repository\Repository.cs:Zeile 310.
bei Nareg.Business.Service.GlnService.GetPersonsThatNeedANewGlnOrUpdate(Int32 takeCount)
in
c:\Daten\Source\gdk\Nareg\Nareg.Business\Service\GlnService.cs:Zeile 203.
bei Nareg.Business.Service.GlnService.GetPersonsThatNeedANewGlnOrUpdate()
in
c:\Daten\Source\gdk\Nareg\Nareg.Business\Service\GlnService.cs:Zeile 106.
bei Nareg.GlnService.Program.Main(String[] args)
in
c:\Daten\Source\gdk\Nareg\Nareg.GlnService\Program.cs:Zeile 116.
Message: Initial check message was not received
in
5000 milliseconds, which indicates a non-working multicast network. Parameters were: MulticastAddr: 224.1.1.1:666, Localpath: nareg-glnservice-acceptance1;nareg-glnservice-acceptance2
Best Regards!
0
Hello Daniel,
Have you according to this link also specified the \HKLM\Software\Microsoft\MSMQ\Parameters\MulticastBindIP registry setting to tackle the correct IP address. Probably this is different from the ones you mentioned.
Hope this helps.
Feel free to ask in case you have any other question.
Regards,
Ralph Waldenmaier
Telerik
Have you according to this link also specified the \HKLM\Software\Microsoft\MSMQ\Parameters\MulticastBindIP registry setting to tackle the correct IP address. Probably this is different from the ones you mentioned.
Hope this helps.
Feel free to ask in case you have any other question.
Regards,
Ralph Waldenmaier
Telerik
OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
0
Daniel
Top achievements
Rank 1
answered on 25 Nov 2014, 01:22 PM
Hi Ralph
Thanks for your answer. Indeed, I forget to set this entry. Now I have three registry entries pointing all on the same IP (BindInterfaceIP, ClusterBindIP and MulticastBindIP). I restarted the msmq service and now the binding seems to work:
The problem persists. I also (temporarely) deactivated the second network adapter in network and filesharing center, but this hadn't no effect. IPv6 is deactivated on the primary adapter - is this a problem?
I would say the problems occurs every 2nd time (but with no reproducible sequence). Also the private message-queue is created every time (also if the exception occurs) - so this should be ok.
Greetings
Thanks for your answer. Indeed, I forget to set this entry. Now I have three registry entries pointing all on the same IP (BindInterfaceIP, ClusterBindIP and MulticastBindIP). I restarted the msmq service and now the binding seems to work:
C:\Users\Administrator>netstat -ano | findstr 1801
TCP 192.168.81.33:1801 0.0.0.0:0 ABHÖREN 2772
The problem persists. I also (temporarely) deactivated the second network adapter in network and filesharing center, but this hadn't no effect. IPv6 is deactivated on the primary adapter - is this a problem?
I would say the problems occurs every 2nd time (but with no reproducible sequence). Also the private message-queue is created every time (also if the exception occurs) - so this should be ok.
Greetings
0
Hi Daniel,
The private queues will get generated if they do not exist.
In order to avoid and security issues here, would it be possible for you to create the needed queues ( 5 in your case ) manually, with the needed multicast settings and also the needed security settings that match your requirements, and see if you then still have these timeouts. By doing this you avoid the creation of the queues and any security issues.
Since it is not consistently failing in your environment, it is a bit hard to exactly tell you what is happening.
Hope this helps.
Regards,
Ralph Waldenmaier
Telerik
The private queues will get generated if they do not exist.
In order to avoid and security issues here, would it be possible for you to create the needed queues ( 5 in your case ) manually, with the needed multicast settings and also the needed security settings that match your requirements, and see if you then still have these timeouts. By doing this you avoid the creation of the queues and any security issues.
Since it is not consistently failing in your environment, it is a bit hard to exactly tell you what is happening.
Hope this helps.
Regards,
Ralph Waldenmaier
Telerik
OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
0
Daniel
Top achievements
Rank 1
answered on 01 Dec 2014, 07:52 PM
Hi Ralph
Thanks for your answer. The security settings seem OK. I have now wireshark running and see the messages appearing on 224.1.1.1.
The initial message is posted every time - even when the described exception comes. So the message queue seems to be working, the inital message is posted but not received within the 5 seconds. Please see the attached screenshot:
MSMQ Screenshot
The first 3 messages are posted as soon as the context-method "GetAll" is called. Then the timeout occurs and finally the exception handler disposes the database (4th message).
So it seems that the listenener (thread?) isn't receiving the message although it has been posted. As I don't know how this initial-message check is implemented, I can't debug this further.
I really rely on this mechanism, as I can't disable the cache at all in all applications. Any workarounds are appreciated...
Greetings
Thanks for your answer. The security settings seem OK. I have now wireshark running and see the messages appearing on 224.1.1.1.
The initial message is posted every time - even when the described exception comes. So the message queue seems to be working, the inital message is posted but not received within the 5 seconds. Please see the attached screenshot:
MSMQ Screenshot
The first 3 messages are posted as soon as the context-method "GetAll" is called. Then the timeout occurs and finally the exception handler disposes the database (4th message).
So it seems that the listenener (thread?) isn't receiving the message although it has been posted. As I don't know how this initial-message check is implemented, I can't debug this further.
I really rely on this mechanism, as I can't disable the cache at all in all applications. Any workarounds are appreciated...
Greetings
0
Daniel
Top achievements
Rank 1
answered on 02 Dec 2014, 02:56 PM
Hi Ralph
Probably I have found a possible bug in your Init-Code in class MSMQClusterImpl of Telerik.OpenAccess.Runtime.
Please see the comments of the decompiled source (sorry...) of your code:
What do you think? This would explain the inconsistent behaviour as sometimes one part of the code is faster (the thread runs with "AboveNormal" priority, so most of the time the Monitor.Pulse can be executed). Am I totally wrong?
Best regards
Probably I have found a possible bug in your Init-Code in class MSMQClusterImpl of Telerik.OpenAccess.Runtime.
Please see the comments of the decompiled source (sorry...) of your code:
internal
MSMQClusterImpl(OpenAccessClusterMsgHandler h,
bool
dbg,
string
multi,
string
loc,
int
ap,
string
humanName, Guid sysName,
bool
pc, AsynchronousInputHandler ah,
int
expirMS,
string
adminP,
int
initialMessageTimeout, IOpenAccessClusterTransportLog log)
{
this
.useTracing = dbg;
this
.multicastaddr = multi;
this
.localPath = loc;
this
.handler = h;
this
.pubName = humanName;
this
.guid = sysName;
this
.app = ap;
this
.performancecounter = pc;
this
.asyncHandler = ah;
this
.adminPath = adminP;
this
.log = log;
this
.expirationTTRQ =
new
TimeSpan(0, 0, 0, expirMS / 1000, expirMS % 1000);
this
.expirationTTBR =
this
.expirationTTRQ +
new
TimeSpan(0, 2, 0);
this
.initialMessageTimeout =
new
TimeSpan(0, 0, 0, 0, initialMessageTimeout);
this
.initialWaitInstance =
new
object
();
try
{
this
.Initialize();
// Runs the Thread "ReceiverRun" -> see below
this
.SendInternal((
byte
[])
null
,
this
.squeue, 0, 0, (
string
)
null
, 7);
lock
(
this
.initialWaitInstance)
// <- possible problem: If this code is called first,
// the receiver-thread
// cannot pulse the monitor as it is already locked
// -> the Monitor.Wait
// times out and this.closed remains false.
{
if
(Monitor.Wait(
this
.initialWaitInstance, initialMessageTimeout))
return
;
this
.closed =
true
;
}
}
catch
{
this
.Dispose();
throw
;
}
}
private
void
ReceiverRun()
{
TimeSpan ts =
new
TimeSpan(0, 0, 5);
bool
flag =
false
;
byte
[] defBuf =
new
byte
[1024];
AsynchronousInputHandler workOn =
this
.asyncHandler;
try
{
while
(!
this
.closed)
{
try
{
using
(Message m =
this
.Receive(
ref
ts))
{
if
(m !=
null
)
{
if
(!flag)
{
lock
(
this
.initialWaitInstance)
// cannot get lock, if already locked by MSMQClusterImpl-ctor
Monitor.Pulse(
this
.initialWaitInstance);
flag =
true
;
}
if
(!
this
.closed)
{
if
(m.AppSpecific ==
this
.app)
this
.ProcessMessage(m, workOn, defBuf);
}
}
}
}
catch
(Exception ex)
{
if
(!
this
.closed)
this
.log.LogError(ex,
"Cluster: Receive loop got exception."
,
new
object
[0]);
}
}
}
catch
(Exception ex)
{
if
(!
this
.closed)
this
.log.LogError(ex,
"Cluster: Receive loop stopped."
,
new
object
[0]);
}
lock
(
this
.receiverDone)
Monitor.PulseAll(
this
.receiverDone);
}
What do you think? This would explain the inconsistent behaviour as sometimes one part of the code is faster (the thread runs with "AboveNormal" priority, so most of the time the Monitor.Pulse can be executed). Am I totally wrong?
Best regards
0
Hello Daniel,
Thank you for getting back to us.
The behavior of Monitor.Wait and Monitor.Pulse is a bit different. The lock for the 'initialWaitInstance' is released as soon as the Pulse or Wait method was called, to allow other threads to access the lock object again.
Nevertheless, and this could be the situation in your case, it can be that the message is returned that fast, that the Wait is not yet called and thus, the Pulse call is triggered but not 'heard' by the Wait method. This could cause the problem in your environment. We have revisited the code again and are now doing additional checks to avoid such a situation. The fix will be part of the next release of Telerik Data Access. Unfortunately, I can not provide you with an eta for the next release.
Additional to that, I have attached our CacheCluster tool. This allows you to use the same mechanisms as used within Telerik Data Access to check the messaging behavior. You can just start two or more instances and check if the messaging works as expected in your environment. Do you face the same problems here as well?
Again, thank you for pointing us in that direction.
Do come back in case you need further assistance.
Regards,
Ralph Waldenmaier
Telerik
Thank you for getting back to us.
The behavior of Monitor.Wait and Monitor.Pulse is a bit different. The lock for the 'initialWaitInstance' is released as soon as the Pulse or Wait method was called, to allow other threads to access the lock object again.
Nevertheless, and this could be the situation in your case, it can be that the message is returned that fast, that the Wait is not yet called and thus, the Pulse call is triggered but not 'heard' by the Wait method. This could cause the problem in your environment. We have revisited the code again and are now doing additional checks to avoid such a situation. The fix will be part of the next release of Telerik Data Access. Unfortunately, I can not provide you with an eta for the next release.
Additional to that, I have attached our CacheCluster tool. This allows you to use the same mechanisms as used within Telerik Data Access to check the messaging behavior. You can just start two or more instances and check if the messaging works as expected in your environment. Do you face the same problems here as well?
Again, thank you for pointing us in that direction.
Do come back in case you need further assistance.
Regards,
Ralph Waldenmaier
Telerik
OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
0
Daniel
Top achievements
Rank 1
answered on 03 Dec 2014, 03:42 PM
Thanks Ralph.
I'm glad you found a possible cause for this issue. I misunderstood the behaviour of Monitor.Wait/Pulse (although is is quite basic stuff...).
The CacheCluster tool seems to work well: The "Check MSMQ" is always green (tested 20 times). The pings are received by all three instances of the program. The receive-count stays 0 in all instances of CacheCluster. Tested on the developer machine and on the production server.
I'm looking forward for the (possible) fix. I guess there will be a remark in the changelog.
Thanks and best regards,
Daniel
I'm glad you found a possible cause for this issue. I misunderstood the behaviour of Monitor.Wait/Pulse (although is is quite basic stuff...).
The CacheCluster tool seems to work well: The "Check MSMQ" is always green (tested 20 times). The pings are received by all three instances of the program. The receive-count stays 0 in all instances of CacheCluster. Tested on the developer machine and on the production server.
I'm looking forward for the (possible) fix. I guess there will be a remark in the changelog.
Thanks and best regards,
Daniel
0
Hi Daniel,
Yes there will be a remark in the change log about the fix.
Since the Cache Cluster tool does its job in your environment, it is most likely that there is some kind of a race condition in your live environment that is causing the observed behavior. The fix should definitely help here.
I would like to thank you again for bringing this to our attention.
Feel free to ask in case you have any other question..
Regards,
Ralph Waldenmaier
Telerik
Yes there will be a remark in the change log about the fix.
Since the Cache Cluster tool does its job in your environment, it is most likely that there is some kind of a race condition in your live environment that is causing the observed behavior. The fix should definitely help here.
I would like to thank you again for bringing this to our attention.
Feel free to ask in case you have any other question..
Regards,
Ralph Waldenmaier
Telerik
OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.
0
Hi Daniel,
I just wanted to update you that a new version of Telerik Data Access has been released that contains the previously mentioned fix. It hopefully fixes the issue in your environment as well.
Do come back in case you have any other question.
Regards,
Ralph Waldenmaier
Telerik
I just wanted to update you that a new version of Telerik Data Access has been released that contains the previously mentioned fix. It hopefully fixes the issue in your environment as well.
Do come back in case you have any other question.
Regards,
Ralph Waldenmaier
Telerik
OpenAccess ORM is now Telerik Data Access. For more information on the new names, please, check out the Telerik Product Map.