How to handle broadcast reports in a cluster environment

Guest ●

Hi Bharath,

if I have understood the scenario correctly as follows:

the LB initially has a user on the primary node and he is using the Schedule Management utility, and then the LB moves him to the secondary node

then the answer is that at this point in time the secondary server can't enable the broadcasts by itself when a request comes to it.

I think in this situation it would be advisable to get the administrator to log in directly to the primary node and not via the LB.

I do know that the Task Scheduler in a clustered environment is definitely going to be totally redesigned in the near future, the developers have lots of ideas of how to improve it, for example if there are hundreds of scheduled tasks it would be much better if they were all distributed among all the nodes rather than trying to run them on the main node.

With this in mind, I can raise an enhancement request to let the developers know of your clustering requirement to do with the Schedule Manager, but before I do this please clarify whether I have understood your scenario correctly or not.

thanks,

David

Reply URL

1

Guest ●

Hi Bharath,

just wondering how you got on with this matter, and whether you would like the ticket closed or not?

regards,

David

Reply URL

1

Bharath Kumar ●

Hi Dave,

Apologies, I missed to reply back. We have reverted the taskscheduler thing. I mean disabled in primary node and enabled it in secondary. it started working now. You may close the ticket, thanks a lot for your support

-Bharath .

Reply URL

1

Guest ●

Hi Bharath,

great to hear it's all working now, thanks for letting me know!

regards,

David

Reply URL

1

Bharath Kumar ●

Hi Dave,

You have correctly understood the situation. Just to add to my previous comment, yes it would be great if this could be raised as an enhancement request. We have several customers with clustered environment and every we are confused with the task scheduler. Ideally, if the primary node is stopped/disabled for some reason, the task scheduler should enable by itself in the secondary node. I mean one would expect this. Can you please explain your last sentence, I didn't get it:

+++++++++++++++++++++++++++++++++

I think in this situation it would be advisable to get the administrator to log in directly to the primary node and not via the LB.

+++++++++++++++

Regards,

Bharath

Reply URL

1

Guest ●

Hi Bharath,

don't worry, spreading the task scheduling across multiple nodes instead of having it enabled on just the primary node is definitely in the developers' road map.

Regarding my last sentence, all I meant was that if the administrator doesn't like the LB moving him to the secondary node where the Task Scheduler is disabled then he should directly access the URL of the primary node (instead of the LB's URL) so that his session won't be moved.

regards,

David

Reply URL

1

Bharath Kumar ●

Hi Dave,

Thanks for your response. Once again we are having issue with broadcast in the clustered setup. I can confirm that the parameter DisableTaskSchedule is set correctly in the file. Broadcasts were working fine, but we had a DB outage and it was restored back. But since then broadcasts are not working. Is it possible that the DB outage disrupted the task scheduler.

Adhoc broadcasts work fine, but scheduled ones does not trigger at all. As it was production we had to restart the services on both nodes already.

We tested with a reported scheduled every minute and this is working now and we assume that the remaining schedules will also work.

We "only" have a problem with a daily schedule though on this maschine (we try to reschedule an existing broadcast for eg. in half an hour but it does not send the report, no error seen in the logs). Have captured the info threads, let me know if that helps.

-Bharath

Files:

Reply URL

1

Guest ●

Hi Bharath,

thanks for the logs, I have investigated them and it looks to me like something is failing in the BMC AR Driver, I will explain my reasoning here:

in the log called "bmcsmartreporting-stdout.2018-06-14.log" you will see there are thousands of lines taken up with the parsing of an SQL query, and each time the parsing stops and ends with the following error message (the 1st one is on line 3796)

failed to parse 'null'in expected time format hh:mm:ss AM/PM

and yet when I searched through all of the Yellowfin project library there is no occurrence of that phrase 'in expected time format':

it would be interesting to learn whether you can find that phrase in the AR Driver.

Please let me know what you think.

regards,

David

Hi Bharath,

thanks for the logs, I have investigated them and it looks to me like something is failing in the BMC AR Driver, I will explain my reasoning here:

in the log called "bmcsmartreporting-stdout.2018-06-14.log" you will see there are thousands of lines taken up with the parsing of an SQL query, and each time the parsing stops and ends with the following error message (the 1st one is on line 3796)

failed to parse 'null'in expected time format hh:mm:ss AM/PM

and yet when I searched through all of the Yellowfin project library there is no occurrence of that phrase 'in expected time format':

it would be interesting to learn whether you can find that phrase in the AR Driver.

Please let me know what you think.

regards,

David

Reply URL

1

Bharath Kumar ●

Hi Dave,

Thanks for your answer. I am very much interested in understanding the info threads. What utility do you use to work with them?

-Bharath

Reply URL

1

Bharath Kumar ●

Everytime there is an outage at the database end, we normally restart the YF tomcat service and all the broadcasts resume.

But this time, we did not restart the tomcat service and observed that all the broadcasts are stuck and do not trigger.

So what I understand is every time when there is an outage at the DB do we need to restart the tomcat service? or refresh the broadcasts?

-Bharath

Reply URL

1

Guest ●

Hi Bharath,

please find attached my InfoThreadParser, it is very easy to use.

Just extract the zip file, then run the jar file called "InfoThreadParser_3".

Then to load all your info_thread.html files just go to File->Open and select them all at once.

Regarding the DB outage and YF's subsequent behaviour, I would like to test it out over here, so please tell what sort of outage it is (e.g. a restart), and also, which build of 7.1 you are observing this behaviour in.

thanks,

David

Files:

Reply URL

1

Bharath Kumar ●

Hi Dave,

Thanks for sharing the info threads parser tool.

The database server was down and it was not reachable and it was up the next day. Since then broadcasts never trigger.

We are on 7.3 build.

-Bharath

Reply URL

1

Guest ●

Hi Bharath,

I have tried to reproduce this issue over here but so far was not able to. I did two different tests, I set a broadcast to run every minute and then:

1) stopped the MSSQLSERVER service on the database server

2) shutdown the database server

and in both tests when I restarted the MSSQLSERVER service/database server after 5 minutes of stoppage, the broadcasts started working correctly again and I was receiving a broadcast per minute.

When you say "The database server was down", what exactly was the state of it? I ask this because I'm wondering if this is crucial to the broadcast issue.

Also, I'm wondering whether the Volatile Data Sources feature might resolve your issue - I noticed in your logs that at the moment you've got it turned off, so could you please turn it on and see if it helps.

regards,

David

Reply URL

1

Bharath Kumar ●

Hi Dave,

Thanks for your help so far. I was able to get more information on what exactly happened on the DB side.

We have an internal load balancer for SQL Server Always On. YF connects to DB using the LB name and not the direct host name of DB.

The loadbalancer in front of the DB was failing. DB server was up and running but not reachable from YF machine host due to the loadbalancer problem.

Temporarily we fixed the issue on one of the YF hosts by modifying the hosts file on the windows box, so that it resolved the loadbalancer name to one of the DB server name.

Next day loadbalancer issue was fixed and we restored original configuration and DB servers were reachable again.

YF service was never restarted but was working fine on ad hoc reports after the loadbalancer issue.

However the broadcasting functionality obviously was broken but this was only discovered later on.

Regarding the volatile datasource config:

What happens after volatilerentries count is reached?

Does yellowfin stops trying and needs a restart until it checks again?

The database servers are on high availability and they were reachable all the time during the incident with the loadbalancers.

Only the connection between DB and YF server host was not working for a day.

On other occasions I have seen in YF that the datasource was marked as "invalid" or "unavailable". I assume there is a mechanism, which checks the datasources aswell?

Could it be that a broadcast should have run but encountered an error due to the nonworking datasource which caused scheduled broadcasts to stop working altogether?

Regards,

Bharath

Reply URL

1

Guest ●

Hi Bharath,

thanks for describing the nature of the outage to me. I would have thought that the scenario of YF not being able to contact the LB would have been represented by my 2nd test in which I shutdown my DB Server, although I guess there could be subtle differences not known to me.

Regarding your question "What happens after volatilerentries count is reached?", YF checks the state of the connection every 30 secs, so when the VOLATILERETRIES limit has been reached it then waits for the next 30 sec period commences and then tests the connection all over again.

Regarding your question "Does yellowfin stops trying and needs a restart until it checks again?" No, YF doesn't need a restart until it checks again, it will check automatically every 30 seconds regardless of the state of the connection. Of course this will mean there is a little bit of extra overhead on the resources, however, if you want to guard against volatile source connections then this is a small price to pay.

One more thing, I made the assumption that your DB server was for your data source, however, if in fact it is for your YF DB (or both!) then there is a similar feature that checks the connection to the YF DB and it is called JDBC Verify and you will need to turn this on instead of (or as well as) the Volatile Sources feature.

regards,

David

Reply URL

1

Bharath Kumar ●

Thanks Dave, let me test with the volatile data sources added in the database.

It was the source DB load balancer which was down.

Regards,

Bharath

Reply URL

1

Guest ●

OK, I'll await your update - hopefully it will be good news...

Reply URL

1

Bharath Kumar ●

Q) What happens after volatile entries count is reached?

- YF checks the state of the connection every 30 secs. When the VOLATILERETRIES limit has been reached it then waits for the next 30 sec period commences and then tests the connection all over again.

You wrote:

This query sets the number of times Yellowfin will retry

INSERT INTO Configuration (iporg, configtypecode, configcode, configdata) VALUES (1, 'SYSTEM', 'VOLATILERETRIES', '5');

This sounds like yellow fin tries 5 times and waiting 30seconds in between but you said it tries indefinitely so the setting seems obsolete?

-Bharath

Reply URL

1

Guest ●

Hi Bharath,

Q) What happens after volatile entries count is reached?

A) then it won't check again for 30 seconds. For example (if VOLATILETIMEOUT=3000, VOLATILERETRIES=3), when the next 30 second period comes around, YF checks the state of the connection, if it's OK then it won't check again until the next 30 sec period comes around. However, if the connection is no good then YF will try again to re-connect, then it will wait the 3 seconds and test it again, if it is OK then nothing more for 30 seconds, however, if it is still no good then it will try again to reconnect, then it will wait for 3 more seconds and then test it again, if the connection is OK this time then nothing more until the next 30 second period comes around, however if the connection is still no good then YF will try and reconnect again.

I hope that makes sense, please let me know if you're still unclear about it.

regards,

David

Reply URL

1

Bharath Kumar ●

Thanks a lot Dave for all your help

Reply URL

1

Guest ●

you're welcome Bharath!

Reply URL

How to handle broadcast reports in a cluster environment

Replies (21)