Title: | NAS Message Queuing Bus |
Notice: | KITS/DOC, see 4.*; Entering QARs, see 9.1; Register in 10 |
Moderator: | PAMSRC::MARCUS EN |
Created: | Wed Feb 27 1991 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 2898 |
Total number of notes: | 12363 |
Hi, A customer have a problem were messages sent between two groups via MRS disappears. This is seen on a VAX/VMS system with DMQ V3.2B. Background information: Late in Januari this year system time jumped one year into the future. (I know about ECO 3204). DMQ V3.0 was running at that time. Time was set back, and no problem was seen. Late in Februari power was lost on the system. After this the problems started (no problems with system time). DMQ was still at V3.0. From now on all MRS messages are lost from one specific group to another specific group. A trace show that messages are "stale" when they are received by the MRS server in the destination group. Last week I was onsite and upgraded DMQ to V3.2B. I did run DMQ$CONVERT_V30_TO_V32.COM and converted the groups. I was rather supprised when I noticed that the problem remains with V3.2B! I took a backup on the [DMQ$V30...] directory tree. Restored this on a internal system at the office. I installed V3.2B. Did run DMQ$CONVERT_V30_TO_V32. And I can now reproduce the problem with the DMQ test utility (after having changed node names in DMQ$INIT.TXT). Some details on the tests on my inhouse system: I enabled trace (DMQ$DEBUG=ALL) for the sending and receiving que. And MRS servers in both groups. Sending queue = 502.202 Receiving queue = 513.1 In the test utility I sent messages with DIP=WF_DQF and UMA=SAF PUT in the test utility returned following status --- PSB type 1 ---- Seq. number: 00CA01F6 PSB Status: 08549082 (%PAMS-E-STALE) 00000121 UMA Status: 08548353 (%PAMS-I-NO_UMA) Here is the trace from group 513 MRS_SERVER 4-APR-1997 19:26:10.39 I Entering PAMS_SCAN_UCBS, prio=1 MRS_SERVER 4-APR-1997 19:26:10.41 I Entering PAMS_SCAN_LLS MRS_SERVER 4-APR-1997 19:26:10.46 I =========================== PAMS Header V2.4 =================================== MRS_SERVER 4-APR-1997 19:26:10.46 I Target= 513.98 Source= 502.98 Alt-tgt= 0.0 Size-data= 12 MRS_SERVER 4-APR-1997 19:26:10.46 I Org-tgt= 513.1 Org-src= 502.202 Ntfy-tgt= 502.98 Owner-PID=20C08AC3 MRS_SERVER 4-APR-1997 19:26:10.46 I Class= 1 Type= -100 Alloc-flag= 1 Priority= 0 MRS_SERVER 4-APR-1997 19:26:10.46 I Del-Mode= 00000140 DIP-sts= 08548323 Visit-cnt= 15 Prev-grp= 502 MRS_SERVER 4-APR-1997 19:26:10.46 I Org-DM= 00000511 UMA-sts= 0 Timeout= 0 Endian= 0 MRS_SERVER 4-APR-1997 19:26:10.46 I Msg-flgs= 00000004 CRC[H:D]= 0000:0000 Seg[C:N]= 0:0 RP= 20 MRS_SERVER 4-APR-1997 19:26:10.46 I Seq-num= 00CA01F6 ROSN= 00000004 LRA= 02010001 Links=FFFC5E00 MRS_SERVER 4-APR-1997 19:26:10.46 I 00000121 34CD3780 006B0000 FFFC5E00 MRS_SERVER 4-APR-1997 19:26:10.50 I Canceling timer for context at 2fcc00 with id:40000103 MRS_SERVER 4-APR-1997 19:26:10.52 I Received msg: Source=502.98 class=1 type=-100 length=12 priority=0 MRS_SERVER 4-APR-1997 19:26:10.54 I Return from PAMS_COMMON_RCV MRS_SERVER 4-APR-1997 19:26:10.55 I [USER seq:(ca01f6,121), src:502.202, tgt:513.1, typ:-100, len:12, dm:140 ] MRS_SERVER 4-APR-1997 19:26:10.59 I [ msg: (ca01f6,121) send over 2.0 recovery protocol] MRS_SERVER 4-APR-1997 19:26:10.60 I [ get_last_rosn: src:(502.98), tgt:(513.1), rosn:(4,885873224) ] MRS_SERVER 4-APR-1997 19:26:10.61 I [ Stale Msg occured with (ca01f6,121). ] MRS_SERVER 4-APR-1997 19:26:10.62 I DQF success - notify sending journal process MRS_SERVER 4-APR-1997 19:26:10.63 I Entering PAMS_REALLOC_MSG MRS_SERVER 4-APR-1997 19:26:10.64 I PAMS_REALLOC_MSG params: *size:0, *msg_ptr:2ffc0000 MRS_SERVER 4-APR-1997 19:26:10.65 I [ SEND_IPI_MSG: Sending MSG_TYPE_IPI_CONFIRM (ca01f6,121) to 502.98 with status 8 This message disappeared. I can not GET the message from queue 513.1. Sending non-recoverable messages from 502.202 to 513.1 works fine. I have allowed R+E access for world to [DMQ$V32...] on my system. Trace files are saved as SPRUTT""::DPA20:[DMQ$V32.LOG.0001_00502]DMQ$EVL_0001_00502.LOG_TRACE970404 (sending group) SPRUTT""::DPA20:[DMQ$V32.LOG.0001_00513]DMQ$EVL_0001_00513.LOG_TRACE970404 All other files (DPA20:[DMQ$V32...]) are world readable. What is the problem here? Should I open a IPMT case on this problem? Regards, Sven-Olof Klasson, CSC Sweden Ps Note 2092.8 suggest there is a new problem (then the one fixed with ECO 3204)
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
2842.1 | PAMSIC::POPP | Deep in the Heart... | Tue Apr 08 1997 13:00 | 11 | |
This problem can be cleared by deleting group 513's submitter history file (dmq$mrs:DMQ$SH_0001_00513.0;1) and by deleting both group's (502 & 513) checkpoint files (DMQ$USER:DMQ$CHKPT_0001_ggggg.DAT). Groups will need to be shutdown before the files are deleted. In the mean time I will analyze the information you have provided. -Lisa | |||||
2842.2 | PAMSIC::POPP | Deep in the Heart... | Tue Apr 08 1997 14:00 | 106 | |
What has occured here is that the system was hosed under V3.0 and even though the system was upgrade to V3.2B, the files that were hosed under V3.0 are still in action. The fix provided in V3.2B is a fix that prevents the problem from ever occuring; however, it does not repair the bad data that made it into the submitter history files in V3.0. Group 502's checkpoint file contains a system time of... Tue Jan 27 01:25:27 1998 Group 513's checkpoint file contains a system time of... Tue Jan 27 01:25:24 1998 Group 513's submitter history file contains the following... from 5.98 - Tue Jan 27 03:46:00 1998 from 8.98 - Tue Jan 27 03:30:12 1998 from 502.98 - Tue Jan 27 03:53:44 1998 The SAF message that was rejected as STALE from group 502 has a ROSN time of Tue Jan 27 01:25:20 1998. What does all of this mean? Basically V3.2B is setting up the correct ROSN time based on the data supplied in the groups checkpoint file (this is good); however, the submitter history file for group 513 contains bad information. It considers any message that is younger than Tue Jan 27 03:53:44 1998 to be old or stale. The files were skewed under V3.0, the code in V3.2B prevents the skew from occuring. Please delete all submitter history files for groups 5, 8, 502, 513, and any other groups involved, AND go ahead and delete all of checkpoint files for these groups (groups must be stopped first). This should correct the problem and it should not reoccur under V3.2B. -Lisa Popp BEA/DEC MessageQ Engineering Some details on the tests on my inhouse system: I enabled trace (DMQ$DEBUG=ALL) for the sending and receiving que. And MRS servers in both groups. Sending queue = 502.202 Receiving queue = 513.1 In the test utility I sent messages with DIP=WF_DQF and UMA=SAF PUT in the test utility returned following status --- PSB type 1 ---- Seq. number: 00CA01F6 PSB Status: 08549082 (%PAMS-E-STALE) 00000121 UMA Status: 08548353 (%PAMS-I-NO_UMA) Here is the trace from group 513 MRS_SERVER 4-APR-1997 19:26:10.39 I Entering PAMS_SCAN_UCBS, prio=1 MRS_SERVER 4-APR-1997 19:26:10.41 I Entering PAMS_SCAN_LLS MRS_SERVER 4-APR-1997 19:26:10.46 I =========================== PAMS Header V2.4 =================================== MRS_SERVER 4-APR-1997 19:26:10.46 I Target= 513.98 Source= 502.98 Alt-tgt= 0.0 Size-data= 12 MRS_SERVER 4-APR-1997 19:26:10.46 I Org-tgt= 513.1 Org-src= 502.202 Ntfy-tgt= 502.98 Owner-PID=20C08AC3 MRS_SERVER 4-APR-1997 19:26:10.46 I Class= 1 Type= -100 Alloc-flag= 1 Priority= 0 MRS_SERVER 4-APR-1997 19:26:10.46 I Del-Mode= 00000140 DIP-sts= 08548323 Visit-cnt= 15 Prev-grp= 502 MRS_SERVER 4-APR-1997 19:26:10.46 I Org-DM= 00000511 UMA-sts= 0 Timeout= 0 Endian= 0 MRS_SERVER 4-APR-1997 19:26:10.46 I Msg-flgs= 00000004 CRC[H:D]= 0000:0000 Seg[C:N]= 0:0 RP= 20 MRS_SERVER 4-APR-1997 19:26:10.46 I Seq-num= 00CA01F6 ROSN= 00000004 LRA= 02010001 Links=FFFC5E00 MRS_SERVER 4-APR-1997 19:26:10.46 I 00000121 34CD3780 006B0000 FFFC5E00 MRS_SERVER 4-APR-1997 19:26:10.50 I Canceling timer for context at 2fcc00 with id:40000103 MRS_SERVER 4-APR-1997 19:26:10.52 I Received msg: Source=502.98 class=1 type=-100 length=12 priority=0 MRS_SERVER 4-APR-1997 19:26:10.54 I Return from PAMS_COMMON_RCV MRS_SERVER 4-APR-1997 19:26:10.55 I [USER seq:(ca01f6,121), src:502.202, tgt:513.1, typ:-100, len:12, dm:140 ] MRS_SERVER 4-APR-1997 19:26:10.59 I [ msg: (ca01f6,121) send over 2.0 recovery protocol] MRS_SERVER 4-APR-1997 19:26:10.60 I [ get_last_rosn: src:(502.98), tgt:(513.1), rosn:(4,885873224) ] MRS_SERVER 4-APR-1997 19:26:10.61 I [ Stale Msg occured with (ca01f6,121). ] MRS_SERVER 4-APR-1997 19:26:10.62 I DQF success - notify sending journal process MRS_SERVER 4-APR-1997 19:26:10.63 I Entering PAMS_REALLOC_MSG MRS_SERVER 4-APR-1997 19:26:10.64 I PAMS_REALLOC_MSG params: *size:0, *msg_ptr:2ffc0000 MRS_SERVER 4-APR-1997 19:26:10.65 I [ SEND_IPI_MSG: Sending MSG_TYPE_IPI_CONFIRM (ca01f6,121) to 502.98 with status 8 This message disappeared. I can not GET the message from queue 513.1. Sending non-recoverable messages from 502.202 to 513.1 works fine. I have allowed R+E access for world to [DMQ$V32...] on my system. Trace files are saved as SPRUTT""::DPA20:[DMQ$V32.LOG.0001_00502]DMQ$EVL_0001_00502.LOG_TRACE970404 (sending group) SPRUTT""::DPA20:[DMQ$V32.LOG.0001_00513]DMQ$EVL_0001_00513.LOG_TRACE970404 All other files (DPA20:[DMQ$V32...]) are world readable. What is the problem here? Should I open a IPMT case on this problem? Regards, Sven-Olof Klasson, CSC Sweden Ps Note 2092.8 suggest there is a new problem (then the one fixed with ECO 3204) | |||||
2842.3 | PAMSIC::POPP | Deep in the Heart... | Tue Apr 08 1997 14:28 | 29 | |
I forgot to mention that the SAF files will need to be drained (or deleted) as well. If the customer cannot afford to delete the SAF files then follow these steps: Stop all processes that are sending recoverable messages. Stop all receiving groups that have submitter history files. DIR DMQ$DISK:[DMQ$V32.MRS.*]DMQ$SH*.* Then delete the submitter history files (del DMQ$DISK:[DMQ$V32.MRS.*]DMQ$SH*.*;*). Restart the receiving groups, All of the SAF messages should be drained into the receiving groups DQF files. Once this is done go ahead and shutdown everybody. Once again delete all of the submitter history files (del DMQ$DISK:[DMQ$V32.MRS.*]DMQ$SH*.*;*). Now we will also delete all of the checkpoint files (del dmq$disk:[dmq$v32.user.*]DMQ$CHKPT_*.DAT;*) Now restart all groups and resume normal messaging activity. -Lisa |