T.R | Title | User | Personal Name | Date | Lines |
---|
500.1 | "invalid memory read access from kernel mode" | CSC32::HUTMACHER | | Tue Feb 18 1997 17:17 | 19 |
| hi Lyndon
i think when you upgraded the console version that it introduced
a new console param memory_test and even though its suppose to come
defaulted at FULL, i have seen it load new firmware with it set to
Partial
suggest check this param
>>>show memory_test if set to None or Partial this is the problem
>>>set memory_test Full
>>>init
then try and boot unix, hopefully that will take care of it.
similar note string in note 403
jim hutmacher mvhs colorado csc 800-354-9000 ext 25561
|
500.2 | | HARMNY::CUMMINS | | Tue Feb 18 1997 17:34 | 28 |
| Reply .1 is basically correct.
The MEMORY_TEST environment variable has existed since Day One. The
default has always been FULL. Users may opt to override the default to
reduce power-up test times. However, OpenVMS has never supported
MEMORY_TEST=PARTIAL/NONE. And Digital UNIX used to support it when
pre-V3.0-10 SRM consoles marked untested memory pages GOOD, but now no
longer officially supports MEMORY_TEST=PARTIAL/NONE due to what is
assumed to be a VMUNIX bug on certain memory configs with V3.0-10
consoles (and beyond). [We made a change to mark untested memory as
BAD in the memory bitmaps we pass to VMS/UNIX per request of the UNIX
group - previously, we were not being compliant with the Alpha SRM.]
Bottom line, UNIX and VMS customers should set MEMORY_TEST=FULL, at
least until UNIX fixes its (apparent) VMUNIX bug and/or VMS adds
support for partially tested memory.
I would guess like Jim did that your system has MEMORY_TEST != FULL.
Setting it to FULL, resetting the machine, and then booting will
hopefully fix the problem.
The V4.8-5 console is required for various options, including B3004-BA
and B3004-DA. I.e. there's no going back to pre-V3.0 consoles for those
customers who want reduced memory test times *and* support for these
new options and others (e.g. the expanded I/O option for the 4000).
Let me know if questions,
BC
|
500.3 | Firmware downgrade to v3.7, switched cpu's back | TROOA::HANDY | L. Handy, MCSE DTN 626-3210 | Tue Feb 18 1997 18:11 | 10 |
| We were able to workaround the problem by DOWNGRADING THE FIRMWARE, and
re-installing the original 400MHz cpu's. The firmware we used was from
the v3.7 firmware cdrom. Note that v3.8 yielded the same problems.
Could someone post a firmware versus cpu versus hardware matrix that
could simplify all of this? According to the documentation available,
Unix 3.2G would function normally on a 4100 with 466MHz cpu's- The
problem was with the required firmware upgrade!
LKH
|
500.4 | | HARMNY::CUMMINS | | Tue Feb 18 1997 18:45 | 16 |
| Did you read note 500.2 and try setting MEMORY_TEST=FULL?
The 466 MHz CPU is only supported by V4.8-5 console. This info was
buried in note 500.s as well.
Finally, assuming Jim Hutmacher and I are correct that your problem is
that MEMORY_TEST is set other than FULL, this is an unfortunate set of
circumstances in that UNIX initially stated support for MEMORY_TEST set
other than FULL, but after we changed per their request, a latent UNIX
VM bug (presumably) was found and has yet to be fixed in this area.
Please reply with an indication of how the system's MEMORY_TEST EV has
been set.
Thanks,
BC
|
500.5 | Reduce guesswork | NETRIX::"[email protected]" | Dave Cherkus | Wed Feb 19 1997 10:05 | 35 |
| It could also have nothing to do with memory.
All the panic indicates is some kernel software tried to read memory
location 0x18, which should never happen because there is no memory
mapped at that address in kernel mode, ever, regardless of what
memory the console tested, etc.
Now that the system is up, presuming you used saved vmunix,
why don't you take the PC and RA values that are in the same set
of messages (i.e. the lines right after the one reporting the
access to location 18) and look them up with the debugger?
For example, suppose the message looks like:
trap: invalid memory read access from kernel mode
faulting virtual address: 0x0000000000000010
pc of faulting instruction: 0xfffffc00004fbdbc
ra contents at time of fault: 0xfffffc00004fbdb8
sp contents at time of fault: 0xffffffff8fceb958
You should be able to do:
# dbx vmunix
(dbx) 0xfffffc00004fbdbc/i
[rmerror_failover:1283, 0xfffffc00004fbdbc] ldq s3, 16(s3)
(dbx) 0xfffffc00004fbdb8/i
[rmerror_failover:1274, 0xfffffc00004fbdb8] stl s5, 76(sp)
In my case, this tells me that there is a problem in the rm failover
code. In your case it may tell you about some other problem, and
that problem could have nothing to do with the amount of memory
that the console tested, etc.
[Posted by WWW Notes gateway]
|
500.6 | firmware upgrade was the problem, memtest=partial | TROOA::HANDY | L. Handy, MCSE DTN 626-3210 | Wed Feb 19 1997 13:07 | 21 |
| Thanks for the prompt replies.
The firmware upgrade procedure automaticaly changed
memtest to "partial", and according to .1 and .2, this is the
root cause of our system panic.
We did have memtest set to NONE before the upgrade, and I would have no
problem with setting it to FULL had it been documented. Better yet,
the firmware upgrade should set this as default.
Note booting the Unix CD, or replacing the 466's with 400's
yielded exactly the same results. This is how we isolated the problem to
the firmware upgrade. The downgrade of firmware and cpu's was an acceptable
workaround for us- no extra steps were required.
No doubt other customers will be impacted by this- I would strongly
recommend that we fix the firmware upgrade
procedure, and document this wonderful 'feature' and distribute it
via a BLITZ.
LKH
|
500.7 | | MAY30::CUMMINS | | Wed Feb 19 1997 15:57 | 74 |
| From: HARMNY::CUMMINS "Bill Cummins, PKO3-2/Q21, 223-4641" 19-FEB-1997 15:45:02.73
To: TROOA::HANDY
CC: GENT,SAVAGE,CUMMINS
Subj: Re: Please read replies #2 and #4 in ALPHASERVER_4100
Re: the problems you reported in ALPHASERVER_4100 note 500.*.
Hopefully you saw my reply note in 500.2 which describes how some of the
confusion over use of the MEMORY_TEST on Digital UNIX based systems came about.
Please note that our firmware readme files and release notes do document this
issue. See attached. We could only document what we knew at the time these
readme and release notes were issued, but if you look at them, you'll see we
*definitely do* tell the user that there are support issues. I believe I was
told by my Digital UNIX counterpart that issues with the AlphaServer 4000/4100
console's MEMORY_TEST EV are similarly documented and release noted, though I
have not verified this first-hand.
How did you attempt to update your firmware? Presumably using LFU? Via network?
Floppy? If network, by booting a single file via the console BOOT command or by
typing LFU at the console prompt and answering questions? Our readme file is
typically pretty hard to miss during most update procedures. That is, unless
the user opts to bypass it and/or ignore it.
I can sympathize with your frustration over this problem. Like you, we get
similarly frustrated by the apparent large number of users/customers who update
firmware without reading readme or release notes. But then, I don't read my VCR
manual when I buy a VCR either. Still, I wouldn't call and complain to my VCR
manufacturer if I did something to my VCR that I later found my manual told me
not to do.
If you have a suggestion for how we can better serve our customers in terms of
warning them about issues like this, I'd like to hear your ideas.
I've extracted bulleted items from our V3.0-10 and V4.8-5 LFU update readme
files. See attached readme file snippets.
Thanks and best regards,
BC
I extracted these bullets from the V3.0-10 LFU firmware readme file, under
changes since last release and issues heading, respectively..
--> Warn user when memory has gone untested due to MEMORY_TEST environment
variable not being set FULL. Print estimated test time for XSROM T24 if
console=serial and memory_test=full.
--> Mark untested memory pages (memory_test=partial/none) as untested
rather than good in bitmaps passed to Digital UNIX (and OpenVMS, even
though not yet supported). Note: see UNIX release notes for list of
UNIX versions and/or patches which support partial memory test.
I extracted this bulleted item from the V4.8-5 LFU firmware readme file..
--> If running OpenVMS or Digital UNIX, the MEMORY_TEST environment variable
must be set to FULL (default). PARTIAL and NONE are not supported.
From: TROOA::HANDY "Lyndon Handy- Americas Benchmark Program" 19-FEB-1997 12:29:05.45
To: HARMNY::CUMMINS
CC:
Subj: RE: Please read replies #2 and #4 in ALPHASERVER_4100
Bill:
No we have not had to opportunity to try the 466's again. We lost the
opportunity to use them in a benchmark with customers present. The firmware
setting problem was very frustrating and lost us that opportunity. We had
to downgrade the firmware and replace the 400MHz cpu's. This does not reflect
well on the installation process. This 'feature' should have been documented
well, and issued in a blitz. I spoke to Ted yesterday about this.
We are benchmarking 'Digital' here, not just system performance.
Lyndon
|
500.8 | | MAY30::CUMMINS | | Wed Feb 19 1997 16:26 | 70 |
| We will be issuing a blitz on this matter. I have also entered note
#503 in this same notes conference which will hopefully serve as a
more visible alert of the problem.
Bottom line: if we had known about the UNIX problem with MEMORY_TEST
from the very start, we almost certainly would have implemented things
differently. The problem has been on again / off again support and
changes in behavior / implementation over time in this particular area.
We therefore tried to deal with the issue via FW readme files and FW
release notes. The bulk of all customers update via CD/ROM and the CD
automatically dumps the readme file to the screen prior to the update.
The user can disregard the readme, but what would you recommend we do
about this?
For the record, and re: reply .6
I'm sorry, but your statement about FW changing the MEMORY_TEST EV
settings is simply not correct. Firmware never modifies the setting of
the MEMORY_TEST EV unless told to do so by the user. It always uses the
value specified by the customer (via the SET command). The default, if
not modified by the user, is FULL. All 4100/4000 boxes are shipped from
Manufacturing with a setting of MEMORY_TEST=FULL.
The problem was that UNIX requested we change the behavior of console
as of the V3.0-10 release (in the area of partial memory testing). And
we did so. Both of us then tested the change, and found no problems on
the memory/system configs we tested.
We were unaware until after V3.0 console was released that all versions
of Digital UNIX have a bug in it that results in panics during booting
on 4100/4000 systems with certain memory configurations and partially
tested memory. All configs we (UNIX group and our FW qual team) tested
succeeded during our qual process. It was only relatively recently
discovered that UNIX had an issue/bug with settings other than FULL.
UNIX is working on a fix for this bug. Also, I just spoke via mail with
my UNIX counterpart and he agrees that we should issue a Blitz on the
matter. He expects the problem to be fixed in the next UNIX release.
Note that there have been notes in this notes conference that talked
about UNIX panics with MEMORY_TEST set other than FULL.
Also note that the V3.0-10 and V4.8-5 readme and release notes
absolutely *do* discuss issues with setting MEMORY_TEST other than FULL
on UNIX and OpenVMS systems (based on what was known at the time the
readmes/notes were written). The V3.0-10 readme refers to issues with
settings other than FULL in the context of UNIX and memory pages not
ever being tested. I.e. pre-V3.0 console behavior versus V3.0 behavior
and beyond. The V4.8-5 readme/notes discuss it in terms of telling the
user to not set it other than FULL because we knew about the UNIX bug
at the time we released V4.8-5. Seethe end of reply .7 for the text
from the readme file.
The FW upgrade does not set the MEMORY_TEST EV by default/automatically
or query the user for the appropriate setting because the console
doesn't know which operating system(s) the customer may want to boot.
Additionally, the LFU update utility is built into the console being
updated - not the console being updated to.. And, operating system
support for MEMORY_TEST less than FULL is planned to be phased in over
time. FW cannot check for operating system version at boot time, etc.
Sure, we could prompt the user during the update for EV settings, but
we felt that a note in the readme and release notes would be adequate.
And two orders of magnitude simpler to implement at that. As with UNIX
or VMS, there are multiple items that can be modified by the user which
may lead to improper system behavior, even crashes.
I admit that ideally the update utility would have built in warnings
and questions for all items that could be set by the user that might
negatively impact system behavior. This may be a direction we pursue
in the future.
|
500.9 | firmware was provided with cpu's on floppy | TROOA::HANDY | L. Handy, MCSE DTN 626-3210 | Wed Feb 19 1997 23:20 | 12 |
| The cpu's were shipped with the 4.8-5 firmware on floppy diskette.
According to our MCS technican, the documentation accompanying the
466 cpu boards indicated that required firmware was already pre-loaded.
However, after installing the cpu's, the tech. determined that this was
not the case, and followed the documented LFU-type installation
instructions. The most recent blitz document was also reviewed as well to
double-check.
I will inquire about the release notes, and whether or not they were
included with the firmware in hardcopy, or on the floppy diskette.
LKH
|
500.10 | Update on firmware kit on floppy | TROOA::HANDY | L. Handy, MCSE DTN 626-3210 | Thu Feb 20 1997 00:46 | 30 |
| Also, re: .8:
Thanks for all your support- here is an update.
1) I checked the package that was shipped with the 466 cpu's:
There is a cpu installation card, floppies (console and Openvms upgrade),
plus a "Dear Customer" cover memo, with attached
"Console/AlphaBIOS update instructions" in hardcopy. I verified that
the MCS tech. followed these instructions, using the LFU utility.
There is no mention of the MEMORY_TEST setting in hardcopy, however,
there is mention of it in the on-disk readme file:
"5. Firmware Anomalies, Restrictions, and Workarounds
--> If running OpenVMS or Digital UNIX, the MEMORY_TEST environment
variable must be set to FULL (default). PARTIAL and NONE are not
supported."
The 'panic' condition is not mentioned here.
2) Note that the MEMORY_TEST parameter was not modified using the
SET command at the 4100 console. After the firmware was upgraded to
v4.8-5, we reset the bootdef_dev and bootos_flags parameters
accordingly, attempted to boot unix and
encountered the panic problem as indicated in .0 .
Yes, I agree a BLITZ will help, and perhaps more explicit wording in
the readme file, ie. 'mandatory' setting of MEMORY_TEST parameter to
FULL.
|
500.11 | dir/title=trap | VIRGIN::GLAUS | Feel like Don Quijote... | Thu Feb 20 1997 03:29 | 5 |
| re .0
A dir/tit=trap etc. would have helped you faster (eg see note 403)
Guido.
|
500.12 | AlphaBIOS CMOS Set-up SAVE/EXIT changes MEMORY_TEST EV | HARMNY::CUMMINS | | Fri Feb 28 1997 18:31 | 70 |
| Lyndon,
The SRM console and AlphaBIOS more or less share the MEMORY_TEST
environment variable (stored in TOY NVRAM). Other notes in this
conference provide more detail. What's important here, is that we
may have discovered why your system seemed to have the MEMORY_TEST
EV setting change out from under you..
See attachment. Will post more details as they become known..
BC
From: HARMNY::CUMMINS "Bill Cummins, PKO3-2/Q21, 223-4641" 28-FEB-1997 18:18:51.91
To: SALEM::HOBBS,SALEM::MCGAR,FATSYS::SIMON,AWAKEN::EWHITE,ZGOST2::SHTAN
CC: CUMMINS
Subj: MEMORY_TEST environment variable and AlphaBIOS CMOS set-up menu SAVE/EXIT
AlphaBIOS will change the MEMORY_TEST EV setting (shared between SRM and
AlphaBIOS consoles) when one saves/exits from the AlphaBIOS CMOS set-up,
even when the memory test enabled/disabled and partial/full menu items are
not touched or altered in any way!
Could you check your process to make sure that MEMORY_TEST=FULL is checked
*after* any CMOS or Advanced CMOS set-ups might be done during the Stage
II config/test process? I'm guessing all systems are leaving MFG with
MEMORY_TEST=FULL and that customers are changing the variable either
purposefully or inadvertently (AlphaBIOS "feature"), but thought I'd check
with you just in case..
Let me know.
Thanks,
BC
From: HARMNY::CUMMINS "Bill Cummins, PKO3-2/Q21, 223-4641" 28-FEB-1997 18:08:47.03
To: OLEUM::BUCHMAN
CC: GENT,CSC32::HUTMACHER,CUMMINS
Subj: MEMORY_TEST environment variable and AlphaBIOS CMOS set-up menu SAVE/EXIT
Hi Matt,
We've been seeing a lot of problems in the Field on UNIX and VMS
machines relative to the SRM console's MEMORY_TEST EV. UNIX believes it
has a VM bug which causes UNIX to panic if MEMORY_TEST is not set FULL
and console is V3.0-10 or greater (on certain memory configs). VMS has
never supported a PARTIAL/NONE setting.
Until just recently, we could not figure out why so many customers were
being impacted. Manufacturing has guaranteed me that they set the SRM
MEMORY_TEST EV to FULL by default. We recently discovered that AlphaBIOS
will reset the shared MEMORY_TEST EV NVRAM location after SAVING/EXITING
from the CMOS set-up menu. I've seen a SAVE/EXIT following an unrelated
CMOS parameter change result in MEMORY_TEST being set NONE and I've also
seen it result in a PARTIAL setting. And I wasn't even in the Advanced
CMOS Set-up menu!
There are two reasons for my sending you this mail:
1) Can you write up exactly how AlphaBIOS works w.r.t. this EV? If
different versions of AlphaBIOS behave differently here, then I'd
also like to have that information.
2) I strongly request/encourage you to make the AlphaBIOS default
memory test settings ENABLED and FULL in the next release. Would
you be willing to do so?
A quick reply with the info we need would be greatly appreciated as I'd
like to get an updated Blitz to the Field as quickly as possible.
Thanks,
BC
|