[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference pamsrc::decmessageq

Title:NAS Message Queuing Bus
Notice:KITS/DOC, see 4.*; Entering QARs, see 9.1; Register in 10
Moderator:PAMSRC::MARCUSEN
Created:Wed Feb 27 1991
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2898
Total number of notes:12363

2842.0. "Messages lost, PAMS-E-STALE (DMQ V3.2B)" by RULLE::KLASSON (Sven-Olof Klasson @GOO) Tue Apr 08 1997 10:11

Hi,

A customer have a problem were messages sent between two groups via MRS
disappears. This is seen on a VAX/VMS system with DMQ V3.2B.

Background information:
Late in Januari this year system time jumped one year into the future.
(I know about ECO 3204). DMQ V3.0 was running at that time.
Time was set back, and no problem was seen.

Late in Februari power was lost on the system. After this the problems
started (no problems with system time). DMQ was still at V3.0.
From now on all MRS messages are lost from one specific group to another
specific group. A trace show that messages are "stale" when they are received
by the MRS server in the destination group.

Last week I was onsite and upgraded DMQ to V3.2B. I did run 
DMQ$CONVERT_V30_TO_V32.COM and converted the groups. I was rather supprised 
when I noticed that the problem remains with V3.2B!
I took a backup on the [DMQ$V30...] directory tree. Restored this on a internal
system at the office. I installed V3.2B. Did run DMQ$CONVERT_V30_TO_V32. And
I can now reproduce the problem with the DMQ test utility (after having changed
node names in DMQ$INIT.TXT).

Some details on the tests on my inhouse system:
I enabled trace (DMQ$DEBUG=ALL) for the sending and receiving que. And MRS
servers in both groups.
Sending queue = 502.202
Receiving queue = 513.1
In the test utility I sent messages with DIP=WF_DQF and UMA=SAF

PUT in the test utility returned following status
                          --- PSB type 1 ----
 Seq. number: 00CA01F6   PSB Status: 08549082 (%PAMS-E-STALE)
              00000121   UMA Status: 08548353 (%PAMS-I-NO_UMA)

Here is the trace from group 513
MRS_SERVER       4-APR-1997 19:26:10.39 I Entering PAMS_SCAN_UCBS, prio=1
MRS_SERVER       4-APR-1997 19:26:10.41 I Entering PAMS_SCAN_LLS
MRS_SERVER       4-APR-1997 19:26:10.46 I =========================== PAMS Header V2.4 ===================================
MRS_SERVER       4-APR-1997 19:26:10.46 I   Target=    513.98   Source=    502.98    Alt-tgt=       0.0 Size-data=      12
MRS_SERVER       4-APR-1997 19:26:10.46 I  Org-tgt=     513.1  Org-src=   502.202   Ntfy-tgt=    502.98 Owner-PID=20C08AC3
MRS_SERVER       4-APR-1997 19:26:10.46 I    Class=         1     Type=      -100 Alloc-flag=         1  Priority=       0
MRS_SERVER       4-APR-1997 19:26:10.46 I Del-Mode=  00000140  DIP-sts=  08548323  Visit-cnt=        15  Prev-grp=     502
MRS_SERVER       4-APR-1997 19:26:10.46 I   Org-DM=  00000511  UMA-sts=         0    Timeout=         0    Endian=       0
MRS_SERVER       4-APR-1997 19:26:10.46 I Msg-flgs=  00000004 CRC[H:D]= 0000:0000   Seg[C:N]=       0:0        RP=      20
MRS_SERVER       4-APR-1997 19:26:10.46 I  Seq-num=  00CA01F6     ROSN=  00000004        LRA=  02010001     Links=FFFC5E00
MRS_SERVER       4-APR-1997 19:26:10.46 I            00000121            34CD3780              006B0000           FFFC5E00
MRS_SERVER       4-APR-1997 19:26:10.50 I Canceling timer for context at 2fcc00 with id:40000103
MRS_SERVER       4-APR-1997 19:26:10.52 I Received msg: Source=502.98 class=1 type=-100 length=12 priority=0
MRS_SERVER       4-APR-1997 19:26:10.54 I Return from PAMS_COMMON_RCV
MRS_SERVER       4-APR-1997 19:26:10.55 I [USER seq:(ca01f6,121), src:502.202, tgt:513.1, typ:-100, len:12, dm:140 ]
MRS_SERVER       4-APR-1997 19:26:10.59 I [ msg: (ca01f6,121) send over 2.0 recovery protocol]
MRS_SERVER       4-APR-1997 19:26:10.60 I [   get_last_rosn: src:(502.98), tgt:(513.1), rosn:(4,885873224) ]
MRS_SERVER       4-APR-1997 19:26:10.61 I [   Stale Msg occured with (ca01f6,121). ]
MRS_SERVER       4-APR-1997 19:26:10.62 I DQF success - notify sending journal process
MRS_SERVER       4-APR-1997 19:26:10.63 I Entering PAMS_REALLOC_MSG
MRS_SERVER       4-APR-1997 19:26:10.64 I PAMS_REALLOC_MSG params: *size:0, *msg_ptr:2ffc0000
MRS_SERVER       4-APR-1997 19:26:10.65 I [   SEND_IPI_MSG: Sending MSG_TYPE_IPI_CONFIRM       (ca01f6,121) to 502.98 with status 8

This message disappeared. I can not GET the message from queue 513.1.
Sending non-recoverable messages from 502.202 to 513.1 works fine.

I have allowed R+E access for world to [DMQ$V32...] on my system.
Trace files are saved as
SPRUTT""::DPA20:[DMQ$V32.LOG.0001_00502]DMQ$EVL_0001_00502.LOG_TRACE970404 (sending group)
SPRUTT""::DPA20:[DMQ$V32.LOG.0001_00513]DMQ$EVL_0001_00513.LOG_TRACE970404
All other files (DPA20:[DMQ$V32...]) are world readable.

What is the problem here?
Should I open a IPMT case on this problem?

Regards,
Sven-Olof Klasson, CSC Sweden

Ps Note 2092.8 suggest there is a new problem (then the one fixed with ECO 3204)
T.RTitleUserPersonal
Name
DateLines
2842.1PAMSIC::POPPDeep in the Heart...Tue Apr 08 1997 13:0011

This problem can be cleared by deleting group 513's submitter history
file (dmq$mrs:DMQ$SH_0001_00513.0;1) and by deleting both group's (502 & 513)
checkpoint files (DMQ$USER:DMQ$CHKPT_0001_ggggg.DAT).

Groups will need to be shutdown before the files are deleted.
In the mean time I will analyze the information you have provided.

-Lisa

2842.2PAMSIC::POPPDeep in the Heart...Tue Apr 08 1997 14:00106

 What has occured here is that the system was hosed under V3.0 and even
though the system was upgrade to V3.2B, the files that were hosed under
V3.0 are still in action.  The fix provided in V3.2B is a fix that prevents
the problem from ever occuring; however, it does not repair the bad data
that made it into the submitter history files in V3.0.



Group 502's checkpoint file contains a system time of...

Tue Jan 27 01:25:27 1998


Group 513's checkpoint file contains a system time of...

Tue Jan 27 01:25:24 1998


Group 513's submitter history file contains the following...

from   5.98 - Tue Jan 27 03:46:00 1998
from   8.98 - Tue Jan 27 03:30:12 1998
from 502.98 - Tue Jan 27 03:53:44 1998


The SAF message that was rejected as STALE from group 502 has a ROSN time of 
Tue Jan 27 01:25:20 1998.


What does all of this mean?


Basically V3.2B is setting up the correct ROSN time based on the data
supplied in the groups checkpoint file (this is good); however, the
submitter history file for group 513 contains bad information.  It
considers any message that is younger than Tue Jan 27 03:53:44 1998
to be old or stale.  The files were skewed under V3.0, the code in V3.2B
prevents the skew from occuring.  Please delete all submitter history
files for groups 5, 8, 502, 513, and any other groups involved, AND
go ahead and delete all of checkpoint files for these groups (groups must
be stopped first).  

This should correct the problem and it should not reoccur under V3.2B.

-Lisa Popp
BEA/DEC MessageQ Engineering





Some details on the tests on my inhouse system:
I enabled trace (DMQ$DEBUG=ALL) for the sending and receiving que. And MRS
servers in both groups.
Sending queue = 502.202
Receiving queue = 513.1
In the test utility I sent messages with DIP=WF_DQF and UMA=SAF

PUT in the test utility returned following status
                          --- PSB type 1 ----
 Seq. number: 00CA01F6   PSB Status: 08549082 (%PAMS-E-STALE)
              00000121   UMA Status: 08548353 (%PAMS-I-NO_UMA)

Here is the trace from group 513
MRS_SERVER       4-APR-1997 19:26:10.39 I Entering PAMS_SCAN_UCBS, prio=1
MRS_SERVER       4-APR-1997 19:26:10.41 I Entering PAMS_SCAN_LLS
MRS_SERVER       4-APR-1997 19:26:10.46 I =========================== PAMS Header V2.4 ===================================
MRS_SERVER       4-APR-1997 19:26:10.46 I   Target=    513.98   Source=    502.98    Alt-tgt=       0.0 Size-data=      12
MRS_SERVER       4-APR-1997 19:26:10.46 I  Org-tgt=     513.1  Org-src=   502.202   Ntfy-tgt=    502.98 Owner-PID=20C08AC3
MRS_SERVER       4-APR-1997 19:26:10.46 I    Class=         1     Type=      -100 Alloc-flag=         1  Priority=       0
MRS_SERVER       4-APR-1997 19:26:10.46 I Del-Mode=  00000140  DIP-sts=  08548323  Visit-cnt=        15  Prev-grp=     502
MRS_SERVER       4-APR-1997 19:26:10.46 I   Org-DM=  00000511  UMA-sts=         0    Timeout=         0    Endian=       0
MRS_SERVER       4-APR-1997 19:26:10.46 I Msg-flgs=  00000004 CRC[H:D]= 0000:0000   Seg[C:N]=       0:0        RP=      20
MRS_SERVER       4-APR-1997 19:26:10.46 I  Seq-num=  00CA01F6     ROSN=  00000004        LRA=  02010001     Links=FFFC5E00
MRS_SERVER       4-APR-1997 19:26:10.46 I            00000121            34CD3780              006B0000           FFFC5E00
MRS_SERVER       4-APR-1997 19:26:10.50 I Canceling timer for context at 2fcc00 with id:40000103
MRS_SERVER       4-APR-1997 19:26:10.52 I Received msg: Source=502.98 class=1 type=-100 length=12 priority=0
MRS_SERVER       4-APR-1997 19:26:10.54 I Return from PAMS_COMMON_RCV
MRS_SERVER       4-APR-1997 19:26:10.55 I [USER seq:(ca01f6,121), src:502.202, tgt:513.1, typ:-100, len:12, dm:140 ]
MRS_SERVER       4-APR-1997 19:26:10.59 I [ msg: (ca01f6,121) send over 2.0 recovery protocol]
MRS_SERVER       4-APR-1997 19:26:10.60 I [   get_last_rosn: src:(502.98), tgt:(513.1), rosn:(4,885873224) ]
MRS_SERVER       4-APR-1997 19:26:10.61 I [   Stale Msg occured with (ca01f6,121). ]
MRS_SERVER       4-APR-1997 19:26:10.62 I DQF success - notify sending journal process
MRS_SERVER       4-APR-1997 19:26:10.63 I Entering PAMS_REALLOC_MSG
MRS_SERVER       4-APR-1997 19:26:10.64 I PAMS_REALLOC_MSG params: *size:0, *msg_ptr:2ffc0000
MRS_SERVER       4-APR-1997 19:26:10.65 I [   SEND_IPI_MSG: Sending MSG_TYPE_IPI_CONFIRM       (ca01f6,121) to 502.98 with status 8

This message disappeared. I can not GET the message from queue 513.1.
Sending non-recoverable messages from 502.202 to 513.1 works fine.

I have allowed R+E access for world to [DMQ$V32...] on my system.
Trace files are saved as
SPRUTT""::DPA20:[DMQ$V32.LOG.0001_00502]DMQ$EVL_0001_00502.LOG_TRACE970404 (sending group)
SPRUTT""::DPA20:[DMQ$V32.LOG.0001_00513]DMQ$EVL_0001_00513.LOG_TRACE970404
All other files (DPA20:[DMQ$V32...]) are world readable.

What is the problem here?
Should I open a IPMT case on this problem?

Regards,
Sven-Olof Klasson, CSC Sweden

Ps Note 2092.8 suggest there is a new problem (then the one fixed with ECO 3204)

2842.3PAMSIC::POPPDeep in the Heart...Tue Apr 08 1997 14:2829

I forgot to mention that the SAF files will need to be drained (or deleted)
as well.

If the customer cannot afford to delete the SAF files then follow
these steps:

Stop all processes that are sending recoverable messages.

Stop all receiving groups that have submitter history files.
DIR DMQ$DISK:[DMQ$V32.MRS.*]DMQ$SH*.*

Then delete the submitter history files 
(del DMQ$DISK:[DMQ$V32.MRS.*]DMQ$SH*.*;*).

Restart the receiving groups, All of the SAF messages should be
drained into the receiving groups DQF files.  Once this is done
go ahead and shutdown everybody.

Once again delete all of the submitter history files
(del DMQ$DISK:[DMQ$V32.MRS.*]DMQ$SH*.*;*).
Now we will also delete all of the checkpoint files
(del dmq$disk:[dmq$v32.user.*]DMQ$CHKPT_*.DAT;*)

Now restart all groups and resume normal messaging activity.

-Lisa