[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::alphaserver_4100

Title:AlphaServer 4100
Moderator:MOVMON::DAVISS
Created:Tue Apr 16 1996
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:648
Total number of notes:3158

500.0. ""invalid memory read access from kernel mode" 466MHZ cpu" by TROOA::HANDY (L. Handy, MCSE DTN 626-3210) Tue Feb 18 1997 16:29

    We are encountering a problem after upgrading our 4100 to 2x466MHz
    cpu's from a 4x400MHz configuration.  In both configurations, 
    we have 4GB memory.  2GB memory was attempted with no effect.
    
    During the Unix v3.2G boot, we get the following errors:
    'kernel argument'
    'trap:'
    'invalid memory read'
    'access from kernel mode'
    'faulting address = 00000000000000018'
    ...
    ...
    panic cpu
    ...dump...
    
    The only other change was that we have upgraded the console SRM to 
    v4.8-5 per the recent BLITZ.  This firmware was provided on a floppy
    that was shipped with the cpu boards.
    
    Note that we CANNOT boot to single user mode or genvmunix.  System
    down.
    
    Please recommend workarounds/fixes.  Your support is appreciated.
    
    Is this a known problem?
    
    Lyndon
T.RTitleUserPersonal
Name
DateLines
500.1"invalid memory read access from kernel mode"CSC32::HUTMACHERTue Feb 18 1997 17:1719
    hi Lyndon
    
    i think when you upgraded the console version that it introduced
    a new console param memory_test and even though its suppose to come
    defaulted at FULL, i have seen it load new firmware with it set to
    Partial
    
    suggest check this param
    
    >>>show memory_test      if set to None or Partial this is the problem
    >>>set memory_test Full
    >>>init
    
    then try and boot unix, hopefully that will take care of it.
    
    similar note string in note 403
    
    jim hutmacher mvhs colorado csc 800-354-9000 ext 25561
                           
500.2HARMNY::CUMMINSTue Feb 18 1997 17:3428
    Reply .1 is basically correct.
    
    The MEMORY_TEST environment variable has existed since Day One. The
    default has always been FULL. Users may opt to override the default to
    reduce power-up test times. However, OpenVMS has never supported
    MEMORY_TEST=PARTIAL/NONE. And Digital UNIX used to support it when
    pre-V3.0-10 SRM consoles marked untested memory pages GOOD, but now no
    longer officially supports MEMORY_TEST=PARTIAL/NONE due to what is
    assumed to be a VMUNIX bug on certain memory configs with V3.0-10
    consoles (and beyond). [We made a change to mark untested memory as
    BAD in the memory bitmaps we pass to VMS/UNIX per request of the UNIX
    group - previously, we were not being compliant with the Alpha SRM.]
    
    Bottom line, UNIX and VMS customers should set MEMORY_TEST=FULL, at
    least until UNIX fixes its (apparent) VMUNIX bug and/or VMS adds
    support for partially tested memory.
    
    I would guess like Jim did that your system has MEMORY_TEST != FULL.
    Setting it to FULL, resetting the machine, and then booting will
    hopefully fix the problem.
    
    The V4.8-5 console is required for various options, including B3004-BA
    and B3004-DA. I.e. there's no going back to pre-V3.0 consoles for those
    customers who want reduced memory test times *and* support for these
    new options and others (e.g. the expanded I/O option for the 4000).
    
    Let me know if questions,
    BC
500.3Firmware downgrade to v3.7, switched cpu's backTROOA::HANDYL. Handy, MCSE DTN 626-3210Tue Feb 18 1997 18:1110
    We were able to workaround the problem by DOWNGRADING THE FIRMWARE, and
    re-installing the original 400MHz cpu's.  The firmware we used was from
    the v3.7 firmware cdrom.  Note that v3.8 yielded the same problems.
    
    Could someone post a firmware versus cpu versus hardware matrix that
    could simplify all of this?  According to the documentation available, 
    Unix 3.2G would function normally on a 4100 with 466MHz cpu's- The
    problem was with the required firmware upgrade!
    
    LKH
500.4HARMNY::CUMMINSTue Feb 18 1997 18:4516
    Did you read note 500.2 and try setting MEMORY_TEST=FULL?
    
    The 466 MHz CPU is only supported by V4.8-5 console. This info was
    buried in note 500.s as well.
    
    Finally, assuming Jim Hutmacher and I are correct that your problem is
    that MEMORY_TEST is set other than FULL, this is an unfortunate set of
    circumstances in that UNIX initially stated support for MEMORY_TEST set
    other than FULL, but after we changed per their request, a latent UNIX
    VM bug (presumably) was found and has yet to be fixed in this area.
    
    Please reply with an indication of how the system's MEMORY_TEST EV has
    been set.
    
    Thanks,
    BC
500.5Reduce guessworkNETRIX::"[email protected]"Dave CherkusWed Feb 19 1997 10:0535
It could also have nothing to do with memory.

All the panic indicates is some kernel software tried to read memory
location 0x18, which should never happen because there is no memory
mapped at that address in kernel mode, ever, regardless of what
memory the console tested, etc.

Now that the system is up, presuming you used saved vmunix,
why don't you take the PC and RA values that are in the same set
of messages (i.e. the lines right after the one reporting the
access to location 18) and look them up with the debugger?

For example, suppose the message looks like:

trap: invalid memory read access from kernel mode

    faulting virtual address:     0x0000000000000010
    pc of faulting instruction:   0xfffffc00004fbdbc
    ra contents at time of fault: 0xfffffc00004fbdb8
    sp contents at time of fault: 0xffffffff8fceb958


You should be able to do:
# dbx vmunix
(dbx) 0xfffffc00004fbdbc/i
  [rmerror_failover:1283, 0xfffffc00004fbdbc]   ldq     s3, 16(s3)
(dbx) 0xfffffc00004fbdb8/i
  [rmerror_failover:1274, 0xfffffc00004fbdb8]   stl     s5, 76(sp)

In my case, this tells me that there is a problem in the rm failover
code.  In your case it may tell you about some other problem, and
that problem could have nothing to do with the amount of memory
that the console tested, etc.

[Posted by WWW Notes gateway]
500.6firmware upgrade was the problem, memtest=partialTROOA::HANDYL. Handy, MCSE DTN 626-3210Wed Feb 19 1997 13:0721
    Thanks for the prompt replies.
        
    The firmware upgrade procedure automaticaly changed 
    memtest to "partial", and according to .1 and .2, this is the
    root cause of our system panic.  
    
    We did have memtest set to NONE before the upgrade, and I would have no
    problem with setting it to FULL had it been documented.  Better yet,
    the firmware upgrade should set this as default.
        
    Note booting the Unix CD, or replacing the 466's with 400's 
    yielded exactly the same results. This is how we isolated the problem to 
    the firmware upgrade.  The downgrade of firmware and cpu's was an acceptable 
    workaround for us- no extra steps were required.
        
    No doubt other customers will be impacted by this- I would strongly 
    recommend that we fix the firmware upgrade
    procedure, and document this wonderful 'feature' and distribute it
    via a BLITZ.
    
    LKH
500.7MAY30::CUMMINSWed Feb 19 1997 15:5774
From:	HARMNY::CUMMINS      "Bill Cummins, PKO3-2/Q21, 223-4641" 19-FEB-1997 15:45:02.73
To:	TROOA::HANDY
CC:	GENT,SAVAGE,CUMMINS
Subj:	Re: Please read replies #2 and #4 in ALPHASERVER_4100

Re: the problems you reported in ALPHASERVER_4100 note 500.*.

Hopefully you saw my reply note in 500.2 which describes how some of the
confusion over use of the MEMORY_TEST on Digital UNIX based systems came about.

Please note that our firmware readme files and release notes do document this
issue. See attached. We could only document what we knew at the time these
readme and release notes were issued, but if you look at them, you'll see we
*definitely do* tell the user that there are support issues. I believe I was
told by my Digital UNIX counterpart that issues with the AlphaServer 4000/4100
console's MEMORY_TEST EV are similarly documented and release noted, though I
have not verified this first-hand.

How did you attempt to update your firmware? Presumably using LFU? Via network?
Floppy? If network, by booting a single file via the console BOOT command or by
typing LFU at the console prompt and answering questions? Our readme file is
typically pretty hard to miss during most update procedures. That is, unless
the user opts to bypass it and/or ignore it.

I can sympathize with your frustration over this problem. Like you, we get
similarly frustrated by the apparent large number of users/customers who update
firmware without reading readme or release notes. But then, I don't read my VCR
manual when I buy a VCR either. Still, I wouldn't call and complain to my VCR
manufacturer if I did something to my VCR that I later found my manual told me
not to do.

If you have a suggestion for how we can better serve our customers in terms of
warning them about issues like this, I'd like to hear your ideas.

I've extracted bulleted items from our V3.0-10 and V4.8-5 LFU update readme
files. See attached readme file snippets.

Thanks and best regards,
BC

I extracted these bullets from the V3.0-10 LFU firmware readme file, under
changes since last release and issues heading, respectively..

    --> Warn user when memory has gone untested due to MEMORY_TEST environment
        variable not being set FULL. Print estimated test time for XSROM T24 if
        console=serial and memory_test=full.

    --> Mark untested memory pages (memory_test=partial/none) as untested
        rather than good in bitmaps passed to Digital UNIX (and OpenVMS, even
        though not yet supported). Note: see UNIX release notes for list of
        UNIX versions and/or patches which support partial memory test.

I extracted this bulleted item from the V4.8-5 LFU firmware readme file..

    --> If running OpenVMS or Digital UNIX, the MEMORY_TEST environment variable
        must be set to FULL (default). PARTIAL and NONE are not supported.

From:	TROOA::HANDY        "Lyndon Handy- Americas Benchmark Program" 19-FEB-1997 12:29:05.45
To:	HARMNY::CUMMINS
CC:	
Subj:	RE: Please read replies #2 and #4 in ALPHASERVER_4100

Bill:

No we have not had to opportunity to try the 466's again.  We lost the 
opportunity to use them in a benchmark with customers present.  The firmware
setting problem was very frustrating and lost us that opportunity.  We had 
to downgrade the firmware and replace the 400MHz cpu's.  This does not reflect
well on the installation process.  This 'feature' should have been documented
well, and issued in a blitz.  I spoke to Ted yesterday about this.

We are benchmarking 'Digital' here, not just system performance.

Lyndon
500.8MAY30::CUMMINSWed Feb 19 1997 16:2670
    We will be issuing a blitz on this matter. I have also entered note
    #503 in this same notes conference which will hopefully serve as a
    more visible alert of the problem.
    
    Bottom line: if we had known about the UNIX problem with MEMORY_TEST
    from the very start, we almost certainly would have implemented things
    differently. The problem has been on again / off again support and
    changes in behavior / implementation over time in this particular area.
    We therefore tried to deal with the issue via FW readme files and FW
    release notes. The bulk of all customers update via CD/ROM and the CD
    automatically dumps the readme file to the screen prior to the update.
    The user can disregard the readme, but what would you recommend we do
    about this?
    
    For the record, and re: reply .6
    
    I'm sorry, but your statement about FW changing the MEMORY_TEST EV
    settings is simply not correct. Firmware never modifies the setting of
    the MEMORY_TEST EV unless told to do so by the user. It always uses the
    value specified by the customer (via the SET command). The default, if
    not modified by the user, is FULL. All 4100/4000 boxes are shipped from
    Manufacturing with a setting of MEMORY_TEST=FULL.
    
    The problem was that UNIX requested we change the behavior of console
    as of the V3.0-10 release (in the area of partial memory testing). And
    we did so. Both of us then tested the change, and found no problems on
    the memory/system configs we tested.
    
    We were unaware until after V3.0 console was released that all versions
    of Digital UNIX have a bug in it that results in panics during booting
    on 4100/4000 systems with certain memory configurations and partially
    tested memory. All configs we (UNIX group and our FW qual team) tested
    succeeded during our qual process. It was only relatively recently
    discovered that UNIX had an issue/bug with settings other than FULL.
    UNIX is working on a fix for this bug. Also, I just spoke via mail with
    my UNIX counterpart and he agrees that we should issue a Blitz on the
    matter. He expects the problem to be fixed in the next UNIX release.
    
    Note that there have been notes in this notes conference that talked
    about UNIX panics with MEMORY_TEST set other than FULL.
    
    Also note that the V3.0-10 and V4.8-5 readme and release notes
    absolutely *do* discuss issues with setting MEMORY_TEST other than FULL
    on UNIX and OpenVMS systems (based on what was known at the time the
    readmes/notes were written). The V3.0-10 readme refers to issues with
    settings other than FULL in the context of UNIX and memory pages not
    ever being tested. I.e. pre-V3.0 console behavior versus V3.0 behavior
    and beyond. The V4.8-5 readme/notes discuss it in terms of telling the 
    user to not set it other than FULL because we knew about the UNIX bug
    at the time we released V4.8-5. Seethe end of reply .7 for the text
    from the readme file.
    
    The FW upgrade does not set the MEMORY_TEST EV by default/automatically
    or query the user for the appropriate setting because the console
    doesn't know which operating system(s) the customer may want to boot.
    Additionally, the LFU update utility is built into the console being
    updated - not the console being updated to.. And, operating system
    support for MEMORY_TEST less than FULL is planned to be phased in over
    time. FW cannot check for operating system version at boot time, etc.
    
    Sure, we could prompt the user during the update for EV settings, but
    we felt that a note in the readme and release notes would be adequate.
    And two orders of magnitude simpler to implement at that. As with UNIX
    or VMS, there are multiple items that can be modified by the user which
    may lead to improper system behavior, even crashes.
    
    I admit that ideally the update utility would have built in warnings
    and questions for all items that could be set by the user that might
    negatively impact system behavior. This may be a direction we pursue
    in the future.
500.9firmware was provided with cpu's on floppyTROOA::HANDYL. Handy, MCSE DTN 626-3210Wed Feb 19 1997 23:2012
    The cpu's were shipped with the 4.8-5 firmware on floppy diskette. 
    According to our MCS technican, the documentation accompanying the 
    466 cpu boards indicated that required firmware was already pre-loaded.  
    However, after installing the cpu's, the tech. determined that this was 
    not the case, and followed the documented LFU-type installation 
    instructions.  The most recent blitz document was also reviewed as well to
    double-check.
    
    I will inquire about the release notes, and whether or not they were
    included with the firmware in hardcopy, or on the floppy diskette.
    
    LKH
500.10Update on firmware kit on floppyTROOA::HANDYL. Handy, MCSE DTN 626-3210Thu Feb 20 1997 00:4630
    Also, re: .8:
    
    Thanks for all your support- here is an update.  
    
    1) I checked the package that was shipped with the 466 cpu's:
    
    There is a cpu installation card, floppies (console and Openvms upgrade), 
    plus a "Dear Customer" cover memo, with attached
    "Console/AlphaBIOS update instructions" in hardcopy.  I verified that
    the MCS tech. followed these instructions, using the LFU utility. 
    There is no mention of the MEMORY_TEST setting in hardcopy, however,
    there is mention of it in the on-disk readme file:
    
    "5. Firmware Anomalies, Restrictions, and Workarounds
    
        --> If running OpenVMS or Digital UNIX, the MEMORY_TEST environment
    variable must be set to FULL (default). PARTIAL and NONE are not
    supported."
      
    The 'panic' condition is not mentioned here.  
    
    2) Note that the MEMORY_TEST parameter was not modified using the 
    SET command at the 4100 console.  After the firmware was upgraded to 
    v4.8-5, we reset the bootdef_dev and bootos_flags parameters 
    accordingly, attempted to boot unix and 
    encountered the panic problem as indicated in .0 .
    
    Yes, I agree a BLITZ will help, and perhaps more explicit wording in
    the readme file, ie. 'mandatory' setting of MEMORY_TEST parameter to
    FULL.
500.11dir/title=trapVIRGIN::GLAUSFeel like Don Quijote...Thu Feb 20 1997 03:295
    re .0
    
    A dir/tit=trap etc. would have helped you faster (eg see note 403) 
    
    Guido.
500.12AlphaBIOS CMOS Set-up SAVE/EXIT changes MEMORY_TEST EVHARMNY::CUMMINSFri Feb 28 1997 18:3170
    Lyndon,

    The SRM console and AlphaBIOS more or less share the MEMORY_TEST
    environment variable (stored in TOY NVRAM). Other notes in this
    conference provide more detail. What's important here, is that we
    may have discovered why your system seemed to have the MEMORY_TEST
    EV setting change out from under you..
    
    See attachment. Will post more details as they become known..
    
    BC
    
From:	HARMNY::CUMMINS      "Bill Cummins, PKO3-2/Q21, 223-4641" 28-FEB-1997 18:18:51.91
To:	SALEM::HOBBS,SALEM::MCGAR,FATSYS::SIMON,AWAKEN::EWHITE,ZGOST2::SHTAN
CC:	CUMMINS
Subj:	MEMORY_TEST environment variable and AlphaBIOS CMOS set-up menu SAVE/EXIT

AlphaBIOS will change the MEMORY_TEST EV setting (shared between SRM and
AlphaBIOS consoles) when one saves/exits from the AlphaBIOS CMOS set-up,
even when the memory test enabled/disabled and partial/full menu items are
not touched or altered in any way!

Could you check your process to make sure that MEMORY_TEST=FULL is checked
*after* any CMOS or Advanced CMOS set-ups might be done during the Stage
II config/test process? I'm guessing all systems are leaving MFG with
MEMORY_TEST=FULL and that customers are changing the variable either
purposefully or inadvertently (AlphaBIOS "feature"), but thought I'd check
with you just in case..

Let me know.
Thanks,
BC


From:	HARMNY::CUMMINS      "Bill Cummins, PKO3-2/Q21, 223-4641" 28-FEB-1997 18:08:47.03
To:	OLEUM::BUCHMAN
CC:	GENT,CSC32::HUTMACHER,CUMMINS
Subj:	MEMORY_TEST environment variable and AlphaBIOS CMOS set-up menu SAVE/EXIT

Hi Matt,

We've been seeing a lot of problems in the Field on UNIX and VMS
machines relative to the SRM console's MEMORY_TEST EV. UNIX believes it
has a VM bug which causes UNIX to panic if MEMORY_TEST is not set FULL
and console is V3.0-10 or greater (on certain memory configs). VMS has
never supported a PARTIAL/NONE setting.

Until just recently, we could not figure out why so many customers were
being impacted. Manufacturing has guaranteed me that they set the SRM
MEMORY_TEST EV to FULL by default. We recently discovered that AlphaBIOS
will reset the shared MEMORY_TEST EV NVRAM location after SAVING/EXITING
from the CMOS set-up menu. I've seen a SAVE/EXIT following an unrelated
CMOS parameter change result in MEMORY_TEST being set NONE and I've also
seen it result in a PARTIAL setting. And I wasn't even in the Advanced
CMOS Set-up menu!

There are two reasons for my sending you this mail:

 1) Can you write up exactly how AlphaBIOS works w.r.t. this EV? If
    different versions of AlphaBIOS behave differently here, then I'd
    also like to have that information.
 2) I strongly request/encourage you to make the AlphaBIOS default
    memory test settings ENABLED and FULL in the next release. Would
    you be willing to do so?

A quick reply with the info we need would be greatly appreciated as I'd
like to get an updated Blitz to the Field as quickly as possible.

Thanks,
BC