[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::digital_unix

Title:	DIGITAL UNIX(FORMERLY KNOWN AS DEC OSF/1)
Notice:	Welcome to the Digital UNIX Conference
Moderator:	SMURF::DENHAM

Created:	Thu Mar 16 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	10068
Total number of notes:	35879

8778.0. "Getting ctape_strat messages inermittently" by RHETT::HEBERT () Mon Feb 10 1997 15:56

    This is information from my customer about the tape errors and
    conditions he is getting.  Thanks in advance for any help!
    
    Armand Hebert
    CSC - Atlanta
    
    ___________________________________________________________________
    
    > 1. what is the firmware of the tape drive (from uerf)                         
                                                                                    
    from scu mt0 is 9B3C and mt1 is 930A;                                           
    we don't know if one or both were contributing to the problem.                  
    note that both are tz877's and using nsr jukebox capability.                    
                                                                                    
    > 2. how often are these happening now?                                         
                                                                                    
    the dramatic loops have occured twice, once last July on a 2100 3.2d-1          
    and Sunday on an 8400 v3.2g.                                                    
                                                                                    
    two ctape_strat messages were showing with every reboot of the 2100             
    last summer, apparently with NSR startup.  that stopped sometime before         
    november (2100A upgrade and v3.2g occured in october).                          
                                                                                    
    i don't have console manager logs before november, i did check both             
    the 2100A and 8400 logs and Sunday's event was the only time since              
    november we've seen those messages on the console.                              
                                                                                    
    summary: not frequent, not recreatable at will, but ugly when it
    happens.       
                                                                                    
    > Have you explored a hardware problem at all                                   
                                                                                    
    the following all sort of imply it isn't hardware, at least to me:              
                                                                                    
      no errors were logged for either device.                                      
      the problem was temporarily recreatable (stopping and restarting              
        the nsr clone process stopped and recreated it).                            
      the problem stopped "by itself"                                               
        (perhaps with the nwadmin mount/dismount?                                   
         perhaps by the scu or mt commands i issued?).                              
      the problem occured last July, different system earlier OS release            
        (tz87 fw of 971B or 9B3C).                                                  
      the cloning job when the errors were being generated seemed to be
    working     
        fine (nsr reported both reading and writing... both tapes active).          
                                                                                    
    however, that doesn't at all rule out the device being in some
    condition        
    causing these messages that firmware could rectify.                             
                                                                                    
    the appearance is that cam_tape is reporting this condition when                
    it apparently isn't a problem, it could be "fixed" by simply                    
    commenting out the printf (but that could void some other situations            
    where it really is a problem that needs to be reported).                        
    i don't have the time to start reading cam_tape from scratch to                 
    understand under what conditions thes message is displayed.                     
    somebody who is familiar with cam_tape may be able to tell us what              
    the conditions are to attempt to simulate it?                                   
                                                                                    
    i had posted the problem to alpha_osf_managers last July before                 
    opening the first SRQ, others had seen it but had no answers.                   
    any existing QAR's or tz87 firmware release notes which imply a                 
    resolution?                                                                     
                                                                                    
    in the event it isn't a known problem and we can't concoct how to               
    simulate it, any suggestions on what we might do for additional                 
    problem isolation (information gathering) when it occurs again                  
    (which may be another 6 months)?                                                
                                                                                    
    if the messages hadn't stopped after the scu|mt|re-mount i had                  
    planned on doing an scu device reset and then power reset of the                
    tz877's to see if that cleared the condition, but didn't have to.               
    in all likelihood something i did cleared the condition, i'm not                
    sure which, but...                                                              
                                                                                    
    The /nsr/logs/daemonlog shows:                                                  
                                                                                    
    Sat 16:30:26 /dev/nrmt0h label without mount operation in progress              
    nsrmmdbd: media db is saving its data, this may take a while                    
    nsrmmdbd: media db is open for business                                         
    nsrd: Sun Feb  2 07:25:31 1997                                                  
    nsrmmdbd: media db is cross checking the save sets                              
    nsrmmdbd: media db is open for business                                         
    Sun 07:32:23 /dev/nrmt0h mount operation in progress                            
    nsrd: Sun Feb  2 10:30:36 1997                                                  
    nsrmmdbd: media db is cross checking the save sets                              
    nsrmmdbd: media db is open for business                                         
    Sun 10:35:41 /dev/nrmt0h mount operation in progress                            
    Sun 10:43:20 /dev/nrmt0h mount operation in progress                            
    Sun 10:57:59 /dev/nrmt1h mount operation in progress                            
    Sun 10:59:54 /dev/nrmt0h mount operation in progress                            
    Sun 11:04:21 /dev/nrmt0h mount operation in progress                            
    Sun 11:09:33 /dev/nrmt1h mount operation in progress                            
    Sun 11:13:57 /dev/nrmt1h mount operation in progress                            
    Sun 11:18:12 /dev/nrmt1h mount operation in progress                            
    Sun 11:20:14 /dev/nrmt0h mount operation in progress                            
    Sun 16:00:45 /dev/nrmt0h unmount operation in progress                          
                                                                                    
    and /nsr/logs/messages shows:                                                   
                                                                                    
    Feb  2 07:25:31 glacier syslog: NetWorker Server: (notice) started              
    Feb  2 07:26:10 glacier syslog: NetWorker Server: (info) Portions
    Copyright )   
    Digital Equipment Corporation 1995. All rights reserve                          
    Feb  2 07:26:21 glacier syslog: NetWorker index: (notice) nsrck is              
    cross-checking index for sxclm.sois.alaska.edu                                  
    Feb  2 07:26:22 glacier syslog: NetWorker index: (notice) nsrck is
    compressing  
    index for sxclm.sois.alaska.edu                                                 
    Feb  2 07:26:25 glacier syslog: NetWorker index: (notice) nsrck is              
    cross-checking index for nugget.alaska.edu                                      
    Feb  2 07:26:25 glacier syslog: NetWorker index: (notice) nsrck is
    compressing  
    index for nugget.alaska.edu                                                     
    Feb  2 07:26:28 glacier syslog: NetWorker index: (notice) nsrck is              
    cross-checking index for glacier.alaska.edu                                     
    Feb  2 07:26:35 glacier syslog: NetWorker index: (notice) nsrim has
    finished    
    cross checking the media db                                                     
    Feb  2 07:27:01 glacier syslog: NetWorker index: (notice) nsrck has
    completed   
    cross-check                                                                     
    Feb  2 07:30:59 glacier syslog: NetWorker media: (info) tz87 tape               
    Backups.031073 will be needed for a recover                                     
    Feb  2 07:31:11 glacier syslog: NetWorker media: (info) suggest
    mounting        
    Offback.030106 for backup to pool 'Offback'                                     
    Feb  2 07:31:11 glacier syslog: NetWorker media: (waiting) backup to
    pool       
    'Offback' waiting for 1 writable backup tape                                    
    Feb  2 07:31:13 glacier syslog: NetWorker Media: (info) loading volume          
    Offback.030106 into /dev/nrmt0h                                                 
    Feb  2 10:30:37 glacier syslog: NetWorker Server: (notice) started              
    Feb  2 10:31:15 glacier syslog: NetWorker Server: (info) Portions
    Copyright )   
    Digital Equipment Corporation 1995. All rights reserve                          
    Feb  2 10:31:25 glacier syslog: NetWorker index: (notice) nsrck is              
    cross-checking index for sxclm.sois.alaska.edu                                  
    Feb  2 10:31:26 glacier syslog: NetWorker index: (notice) nsrck is
    compressing  
    index for sxclm.sois.alaska.edu                                                 
    Feb  2 10:31:27 glacier syslog: NetWorker media: (info) tz87 tape               
    Offback.030106 was being written before crash                                   
    Feb  2 10:31:30 glacier syslog: NetWorker index: (notice) nsrck is              
    cross-checking index for nugget.alaska.edu                                      
    Feb  2 10:31:31 glacier syslog: NetWorker index: (notice) nsrck is
    compressing  
    index for nugget.alaska.edu                                                     
    Feb  2 10:31:32 glacier syslog: NetWorker index: (notice) nsrim has
    finished    
    cross checking the media db                                                     
    Feb  2 10:31:33 glacier syslog: NetWorker index: (notice) nsrck is              
    cross-checking index for glacier.alaska.edu                                     
    Feb  2 10:31:46 glacier syslog: NetWorker media: (info) read 395
    records in     
    Backups.031073 into /dev/nrmt1h                                                 
    Feb  2 11:19:04 glacier syslog: NetWorker media: (info) suggest
    mounting        
    Offback.030109 for backup to pool 'Offback'                                     
    Feb  2 11:19:04 glacier syslog: NetWorker media: (waiting) backup to
    pool       
    'Offback' waiting for 1 writable backup tape                                    
    Feb  2 11:19:05 glacier syslog: NetWorker Media: (info) loading volume          
    Offback.030109 into /dev/nrmt0h                                                 
                                                                                    
    The system was rebooted ~7:25.  NSR cloning was started after that.             
    The system panic'd and restarted around 10:30 and that's when the               
    ctape_strat messages started.  According to the nsr daemonlog,                  
    only mt0 was re-mounted by nsr prior 10:57.  My speculation on                  
    Sunday was that only the dismount|mount of mt1 generated the                    
    messages (see the first work order on the SRQ).                                 
    
    
    Here's my speculation based on all the above information:                       
                                                                                    
      The problem is triggered by a tape remaining mounted across a panic           
        (or possibly any reboot).                                                   
      It was mt1 generating the messages, that was the read half of the
    clone.      
      The manual remount of mt1 @10:57 cleared the condition causing the
    messages.  
      It's possible mt0 had the condition as well, but NSR's remount of the         
        device with start-up cleared that.                                          
                                                                                    
    Can you follow my reasoning on this... if not we may wish to switch             
    to phone (verbal)... (907)474-6266.  If so, here's what I recommend:            
                                                                                    
      You have somebody confirm that a mount would likely clear the
    condition       
        (it's a good bet).                                                          
      I add dismount + remount to our local nsr startup procedures                  
        (this should prevent us from getting in the infinite looping                
        condition ever again).                                                      
      You QAR the problem for either general change to NSR start procedures         
        to always re-mount on startup, and|or cam_tape fixing this                  
        situation.  Note, we can likely prevent the problem with NSR...             
        but anything else could encounter it.                                       
                                                                                    
    I think I may have just isolated the problem, what do you think?                
    Might even be recreatable if somebody can invest the effort.                    
    kurt                                                                            
    _____________________________________________________________________           
    Kurt Carlson,      University of Alaska SOIS/TS,        (907)474-6266           
    [email protected]   910 Yukon Drive #105.63, Fairbanks,  AK 99775-6200           
    __________________________________________________________________________
    __________________________________________________________________________
    __________________________________________________________________________

T.R	Title	User	Personal Name	Date	Lines
8778.1	Can we get context here?	DECWET::FARLEE	Insufficient Virtual um...er....	`Tue Feb 11 1997 19:05`	16
	Greetings. I'm in the NetWorker engineering group, and I'm trying to make sense of this... It appears as if this is one entry in an ongoing problem report, but we don't have the full context: There are references to "the ctape_strat messages", but not the full message text. There are references to "the big ugly loop", but not a description (that I can see) of what is looping, and what the symptoms are. If you can provide a bit more context, perhaps I can help from an application front. Kevin Farlee
8778.2		DECWET::RWALKER	Roger Walker - Media Changers	`Wed Feb 12 1997 16:10`	10
	I would expect that these are the "Read density invalid" errors reported by the tape driver in function ctape_strat. Normally at system start up. This was QARed prior to the release of DIGITAL UNIX 4.0 but by the time the tape driver developer could get to the problem we could not replicate it and neither could they. If UEG can be provided a set of steps to reproduce the problem reliably I'm sure they would be happy to fix it.
8778.3	ctape_strategy messages	FRAIS::KHAN		`Thu Feb 20 1997 11:36`	12
	Yes, we also have a customer getting these messages during startup. The sequence is like: ...vmunix: Starting CPU... ...vmunix: SuerLAT. C...... ...vmunix: fta0: Link... ...vmunix: ctape_strategy: READ case and density info not valid ...vmunix: ctape_strategy: READ case and density info not valid ... The customer thinks that the tapes are slower than before. /Azfar
8778.4	after tape added ?	FRAIS::KHAN		`Thu Feb 20 1997 11:54`	8
	I talked to the customer and got some more information. To my question as to when the messages started, he thinks it is since the second Tape drive ( TZL07 ) was added. The first tape drive is a TK87 ( same bus ). He did not build a new kernel, but just MAKEDEV'd it. ( Maybe after a kernel rebuild the message disappears ). But at present we get the message on reboot. Who has any troubleshooting ideas ? /Azfar
8778.5		NETRIX::"[email protected]"	Jan Reimers	`Thu Feb 20 1997 22:00`	41
	The ctape_strategy messages are coming from the CAM tape driver. The message is a warning that something has not been setup correctly for this particular tape drive inside the tape driver. This could be an indication that the tape drive is not responding correctly to a MODE SENSE command, which is used to determine, among other things, the density at which the tape has been written. It is unlikely that this is the cause for the message, however. More likely is that there is some other application which is "stealing" the UNIT ATTN signal from the tape driver. I think what is happening here is that some other application is accessing the tape drive through the user agent (/dev/cam) before the tape driver has opened the device for the first time. The tape driver will normally do a MODE SENSE if the tape is either at BOM or has the UNIT ATTN set. According to the SCSI-2 standard the tape drive will hold the UNIT ATTN for that initiator until the first command, at which time, the UNIT ATTN is released. If and application accesses the tape drive through the user agent before the first open with the tape driver, then the UNIT ATTN is no longer available to the tape driver and it will not know to do the MODE SENSE and setup the density. Are these messages seen ONLY at startup time? Or are they seen at other times when the tape is being read? If only at startup time, can they be reproduced if the tape is ejected, reinserted, and whatever application is accessing the tape drive through /dev/cam is started? If this is the correct scenario, a possible workaround would be to open the tape drive through the tape driver immediately after a tape has been inserted. If that makes the messages disappear, then we need to figure out what is "stealing" the UNIT ATTNs. FWIW, this is only a warning. The tape will still be read correctly because the tape driver will tell the tape drive to use it's default density which will cause the tape drive to read at whatever density the tape was written at. Jan Reimers [Posted by WWW Notes gateway]