[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference kernel::csguk_systems

Title:CSGUK_SYSTEMS
Notice:No restrictions on keyword creation
Moderator:KERNEL::ADAMS
Created:Wed Mar 01 1989
Last Modified:Thu Nov 28 1996
Last Successful Update:Fri Jun 06 1997
Number of topics:242
Total number of notes:1855

193.0. "ALPHA Bugchecks ??" by KERNEL::ADAMS (Brian Adams CSC-Viables '833-3026) Sun Nov 06 1994 14:01

    
    In a recent mail to Dave Gledhill, I mentioned to him, that there
    was a pressing need, within the Systems Group, to ramp up on Alpha
    Bugcheck Analysis skills.
    
    I suggested that as we have some on tape, it might be an idea for him
    to go through them, with one or two members of the group, at a time
    so we can see what's what, and get some skills.
    
    His response is the next reply. Please feel free to add your comments
    and suggestions.
    
    If Norman Bland wants to take this to the new notes file, I have no
    objections.
    
T.RTitleUserPersonal
Name
DateLines
193.1DG He say this !!KERNEL::ADAMSBrian Adams CSC-Viables '833-3026Sun Nov 06 1994 14:0462
	About the training, I intended to do a few more talks, the most
imporant one being the calling-standard as that is the thing that has changed
the most. The plan was for me to this, IO internal and maybe some others also 
for Ian megarity to do one on memory management and maybe some others.
(it would have taken a while for both of us to prepare these 
ones as are really complicated in axp.)

	 However when steve left suddenly just when I was starting teaching
this stuff Brian said that we would have to can it for a few months due to not 
enouth bums on seats!. I doubt if anything has changed!, plus me and Ian have 
been quite busy recently as has everyone else. 
	

	What I propose as a workaround is to create a notes-file. When we close
an axp call try to write it up with a bit of detail of whats going on,
put the relevant parts on the call. Folks can ask questions and add comments
(as replies to the note) and hopefully explain things if they are not clear. 
Also if you guys can archive the directory contents - dump, plus any temporary
files calls onto ta90 (use the note no as the label) then if you want to check 
the dump at a later stage or use it as material for a talk, it will be easy
to get it back. 

	You guys let me know if this is a good idea, if so put this message
in the notes file to see what the others think. I will also copy Ian, geoff
and Brian to see if they want to put their calls on this notes file.

	(The easiest way to do this would be to have a que, say sys_archive.
Any calls worthy of this treatment can go into that queue, if anyone gets time
they can archive the directory contents, delete them and return the call
to who-ever worked on it. (better this way as it makes it easier for ccd should
the call be re-opened). 
	

	About Ruth Goldenburg, do you mean the internals book or bugcheck.mem.

	I think there is an axp version of bugcheck.mem that you guys should 
have in the library if you haven't got it let me know, I will print a copy off.

	
	Note The stack patterns for the bugchecks look quite different to vax
but that is due more to the change in calling standard than anything else.

	If internal book you mean, there are axp internals book around I am
sure you guys have some, if not I know Ian and maybe geoff have copies (I don't)
Most of the time the vax internals are near enough, can find out the gory
details/differences from the sources. If you haven't got a hard copy let me
know - I can print some off some postscript files.

	(steve copied some over a while back, but better to check with me
first as I will check to see if there are any later versions in the states).

	Note that that in many areas the vax internals principles are much the
same, the details are different due to the difference in architecture, calling
standard, size of data items, naming conventions etc. This is why I started off
teaching the architecture and was going to do the calling standard. Once you 
know that stuff everying slots into place a bit better.

	If we have some time when I can do some more talks let me know, I 
will need some time to prepare though.

	dg.
193.2-yesKERNEL::ANTHONYMon Nov 07 1994 23:0929
    
    	this is interesting!!
    
    	I sat in on a couple of Dave's talks in the summer (seems like
    	a long time ago!) and honestly remember very little...
    	
    	I think the way to learn this stuff is by example.. It's so 
    	much easier to have something to refer back to.. go through
    	a worked example and then apply the knowledge to the problem at 
    	hand.
    
    	We probably need no more than 2-3 bugcheck examples to start
    	with, say one of invexceptn, ssrvexcept, and pgfiplhi
    	
    	What we need is CLEAR CONCISE analysis, not reams  of
    	datastructures that mean nothing unless you are a guru
    	yourself.  Something like the more detailed Stars stuff, but
    	with more explanation and step-by-step detail.
    
    	I still think Dave needs to go ahead and give the talks, but
    	base the talks on the articles and if necessary re-visit them
    	at a deeper level another time.
    
    	This will need a LOT of work to set up.. DG are you volunteering?
    	:-)
    	
    	we can use the systems_tech notes file for this?
    
    	Brian
193.3DG he speaketh againCOMICS::GLEDHILLThu Nov 10 1994 19:2624
I think I must have used reams of datastructures in .1 as I don't think it
was clear to anyone but me.

What I thought the situation was that we didn't have time to do any more talks
whether they were of internals details OR going thru dump files. (when I did
the talks before it was to only 2 or 3 folks at a time).
(well it was you Brian who told me to can the talks until it got less busy).

So I was suggesting that I (and everyone else closing calls) write up the alpha
crashes we close and put them in the notes file and people can read them at
a spare moment or at home etc, and ask questions (via replies to the notes).

If we ever get any time then we can go and do talks on these crashes or 
internals or whatever. 

I am quite happy to talk as much as you like, but have to balance that against
less time to take calls. As you will have noticed recently I have been taking 
more calls from vms and also taking some holiday over the last few ages (so you
guys probably wouldn't notice.

So what are you saying, is the bum/seat ratio such that we can do talks again?


dg.
193.4A plea for some practical instructionKERNEL::BLANDNorman Bland 833 3797 CSC, BasingstokeSat Nov 12 1994 10:4751
    
    OK, Norman B is joining the debate.
    
    From my perpective, there are some issues which need to be understood.
    
    o - We have varying skill levels in analysing VAX system bugchecks
        (forget Alpha systems for the moment). There is a large gulf
        between the 'very best' and the 'not so good'. Although we need some
        focus on analysing Alpha system bugchecks, I do not think we should
        abandon assisting those who are in the 'not so good' category (that
        includes me), in improving their skills on Vax system bugchecks.
    
    o - Sometime ago, when I and others in the old RDC group were improving
        their bugcheck analysis skills, we were placed into the FAST TRACK
        group; bugcheck analysis was done (in the main) by the other group
        called BUGCHECK ANALYSIS & 9000 (or something similar). Skills were
        not totally lost but it did not help when some of the ex FAST TRACK
        were placed into the, now, SYSTEMS group. 
    
    o - Even if we have sufficient people in the group now (bums on seats),
        to attend seminars which theorise about OpenVMS AXP is of little
        use. Whilst some theory is necessary, what I need (and I believe
        some other Systems engineers), is a practical approach to analysing
        bugchecks. When I am analysing an instruction stream within an
    	AXP bugcheck, when I am considering what data structure I should be
        looking at, when I am considering what that structure contains, I
        need to do it when actually viewing a 'real' bugcheck. With someone
    	beside me with the appropriate skills explaining what an
        instruction is doing, what commands are available for analysis, the
        data structures involved, why he is using a particular
        troubleshooting technique.
    
    o - Whilst the idea of saving bugchecks for later study appears to be
        a good idea for improving bugcheck analysis skills, I cannot for
        the life of me see how we will get the time to do this without
        having a serious impact on the manning levels within the group.
    
    o - The idea of writing up details of bugchecks that have been analysed
        and placing then in a notesfile, would, I believe, be useful. This
        would enable us to ask questions with replies as to 'how' and 'why'
        etc.
    
    	OK, I am not sure how we do it but I have tried to summarise what
        I require to improve my bugcheck skill level. This requires some
        theory and a lot of practical work. Different individuals within
        the group will have different needs. I need a skilled person to
        'teach' me not to blind me with science. Whilst I may never be as
        good as the best within our group, I know that I could be
        considerably better at bugcheck analysis giving the appropriate
        knowledge.
    
193.5COMICS::GLEDHILLSun Nov 13 1994 10:12114
I had a similar conversation with Brian yesterday...

I am probably out of step here, depending on what you mean by analyzing bug
checks (i guess), but in my opinion there isn't a way to 'analyze bugchecks'
A bugcheck is just a dump of the system at the time. If you understand how
the system is supposed to work then you look at the dump and see that it
is not doing what you expect etc...

I never been on a bugcheck analysis course, never sat in with anyone who
analyses them for a living. Just make it up as I go on... (bit like sex, 
you can read all the books, but at the end of the day you just get on with it,
use your imagination (and any handy tools lying about)). I think once you try
and pin it down with techniques you lose it.

What I am getting at is that if I could say when you get a particular crash
then you do this, that, then the other it would be easy. Might as well disband
the group and write some software to do it. In my opinion the theory is the
most essential part of it, if you are wanting to get past the diagnosing stage
of the job. Me and probably Im, Gj etc learned that by sitting at home looking
thru the internals books, source code etc as much as anything else.

You need to understand how the system works as a whole fits together. When you 
look at a crash you see that the system was in a particular place You ask
yourself questions, how did it get there, should it be be there etc, what else
was going on at the time, what went on before... Its no different to any
other form of trouble-shooting.

As far as finding your way through the mechanics of a dump (ie stacks etc) 
the most important things to do are know the architecture and the calling
standard as I already said above. Without that you don't really have a chance
to work out whats going on esp on axp. (As a good example of that, one of the 
early bugs in axpvms was when decnet uses R31 to store some stuff in...)

Once you work out where you are you can look at the sources and work out whats
supposed to be happening, then you compare that with the dump and see what
is really happening. Once you find the discrepancy your on your way. 
Trouble is if you don't understand how vms is supposed to work often the sources
don't make a lot of sense. That was the good thing about the old vms tcd for
level 7, it made sure that you understood a lot of that stuff like (eg)

io-database 
system services, how they are dispatched.
syncronization (mutexes, spinlocks etc).
memory-management (ptes, pfns, pfls etc).
Interrupt, ast dispatching

and so on.

some more comments...

    o - Even if we have sufficient people in the group now (bums on seats),
        to attend seminars which theorise about OpenVMS AXP is of little
        use. Whilst some theory is necessary, what I need (and I believe
        some other Systems engineers), is a practical approach to analysing
        bugchecks. When I am analysing an instruction stream within an
    	AXP bugcheck, when I am considering what data structure I should be
        looking at, when I am considering what that structure contains, I
        need to do it when actually viewing a 'real' bugcheck. With someone
    	beside me with the appropriate skills explaining what an
        instruction is doing, what commands are available for analysis, the
        data structures involved, why he is using a particular
        troubleshooting technique.
  

This may help, but most bugchecks are different, the danger with this is relying
on copying what someone else did last time and trying to apply it on the next
one.  

A lot of analysis is not a 1 stop process. Most calls I look at for a while
get an idea whats going on, read a couple things, ask a chap, go onto something
else or down the pub. Come back to it later. I find a lot of stuff I work out
when I am not thinking about them. IF you try too hard it makes it hard. This is
why I often say to you chaps I will have a look later. If it was going through
calls already closed though that wouldn't be a problem.

A lot of the above you can work out for yourself. SHould be able to work out
what the instruction does from the book and what I tried to explain in the 
talk. When you find the stuff in the sources that should tell you what 
data-structures are supposed to be around, you can have a look at them and
see whats going on. HOwever you will need the background theory to understand
where that bit of code/structure fits in/does etc.

The commands for analysis are mostly listed under help. (there are couple 
undocumented) Don't forget there are dcl utilities that you can use to save 
time (sort, diff, search for example).

Don't get me wrong, I am not saying practical demonstration would not help.
But if you rely on that alone without using your understanding and imagination 
as well that demonstration will probably only be of use on how to solve THAT 
bugcheck.
  
    o - Whilst the idea of saving bugchecks for later study appears to be
        a good idea for improving bugcheck analysis skills, I cannot for
        the life of me see how we will get the time to do this without
        having a serious impact on the manning levels within the group.

Do you mean the time archiving them or looking at them later?? maybe could just 
keep the customer tape as I suggested in an earlier note. Or automate using
sls or something.
    
    o - The idea of writing up details of bugchecks that have been analysed
        and placing then in a notesfile, would, I believe, be useful. This
        would enable us to ask questions with replies as to 'how' and 'why'
        etc.
    
Not only useful for the people reading it, if you write it up as you go along
I often find it helps the analysis. ONce you try explaining it in words you can
notice mistakes/false assumptions and so on. I think we should all do this
whether we put the call in the notesfile at the end of not. (DOn't think all
calls need do, just those that seem to have educational value.)

I could go on, but think I better go do some work...

dg.
193.6TCD would help - practical+theory would helpKERNEL::BLANDNorman Bland 833 3797 CSC, BasingstokeSun Nov 13 1994 14:2563
> depending on what you mean by analyzing bugchecks

I guess the real problem is not understanding the VMS internals sufficiently
well enough.

> As far as finding your way through the mechanics of a dump (ie stacks etc) 
> the most important things to do are know the architecture and the calling
> standard as I already said above.

Don't understand them well enough. Sitting in an office and going through the
theory without having an example/examples to work through, is of little use
to me.

> Trouble is if you don't understand how vms is supposed to work often the sources
> don't make a lot of sense. That was the good thing about the old vms tcd for
> level 7, it made sure that you understood a lot of that stuff like (eg)

Absolutely. I have attempted to restart TCD with Brian Lindley but lately the
number of people on shift and the number of calls in our queue have made this
unrealistic. My main aim was to do VMS, starting from a level and working up.
The hope was that the learning would help with bugcheck analysis and for other
software related calls.

> A lot of the above you can work out for yourself. SHould be able to work out
> what the instruction does from the book and what I tried to explain in the 
> talk. 

I do try but often get stuck; if I didn't, I would not be writing this note.

> When you find the stuff in the sources that should tell you what 
> data-structures are supposed to be around, you can have a look at them and
> see whats going on.

If you understand the internals well enough.

> HOwever you will need the background theory to understand
> where that bit of code/structure fits in/does etc.

Yes. But please please let us do this in the context of looking either at
'real' bugchecks or at a 'live' system.

> Don't get me wrong, I am not saying practical demonstration would not help.

So why can't we combine some theory with practical?

> Do you mean the time archiving them or looking at them later??

NO. What I meant was the time to review the cases (bugchecks) with someone who
has a good understanding, in order to learn something.
    
> Not only useful for the people reading it, if you write it up as you go along

This will have to be a new disipline; namely ensuring that relevant information
from the dump is saved and placed into a notesfile.

> I often find it helps the analysis. ONce you try explaining it in words you can
> notice mistakes/false assumptions and so on. I think we should all do this

I see this as being something that could be VERY useful.

Norman B

193.7COMICS::GLEDHILLSun Nov 13 1994 15:3831
I just had a chat with Norman about this on the phone, but he had to go off
and take a call!

I think that proves the point about lack of time to do this sort of stuff.

I think we will have to make time for me to finish what I started in the
summer. We will have to check with Paul first and then book some time.

It was my intention shortly after going thru the instructions to do a talk on 
the calling standard, which would have included REAL stacks - but got canned.
The idea being after the theory to go through a stack printout and work out what
everything was on there. I think this is probably the sort of thing that you 
want.

This should help in diagnosing (ie working out how we got to where we are).

To get any further need to do some internals stuff. As I said to norm on
the phone, I don't know what prerequeisites are for the internals course, but
what I did was read the architecture and system programming documentation. 
This gave me a good enough overview to get something out of the course. (I did
read the first few chapters of the internal book first  as  well on the advice
of the great Ian Megarity),

As we also discussed on the phone dont'think there any short cuts, either 
we or the company (or both) have got to make time for this.

What about  the notes file? Do you want to set one up. Shall I do it or is
someone else in charge of this sort of thing??

dg.

193.8how about..KERNEL::ANTHONYMon Nov 14 1994 19:5237
    
    	Ok how about we start a three pronged attack on this?
    
    1	DG sets up an entry in this notesfile for discussion of
    	one crashdump that we have on the system. (AXP dump)
    	Dave please choose one which is the most appropriate.
    	Give a pointer to the dump. AND GIVE US ALL A WEEK TO HAVE
    	A LOOK AT IT!! (no cheating, don't look at the call update!!)
    
    	We can add replies to start the analysis. If we
    	are way off, DG can give hints.  We should all end up understanding
    	the thought processes needed to go through that PARTICULAR dump.
    	You should not be afraid of replying and looking a fool if you are
    	wrong.. this is a learning exercise!!
    
    	If worthwhile we start on another dump.. over a period of time, we
    	will build our expertise, and have a record of analysis to refer
    	to when we are stuck on a 'real' dump.
    
    2	Dave: create ANOTHER entry for this notesfile that will be used as a
    	learning tool to understand the AXP calling standard.  I see it 
    	firstly as a write up on the standard by DG, followed by Q&A's
    
    	If this is successful, DG chooses another topic and we go through
    	the same process..
                                  
     3	THEN after Christmas, when we have reasonable manning, we schedule
    	DG to run seminars on the calling standard etc.  we will have 
    	examples (here) to refer to and have sufficient understanding
    	before the seminar such that the info DG gives is much more 
    	meaningfull..
    
    	What does the team think?
    
    	we could start this NOW!!!
    
    	Brian
193.9OpenVMS VAX/AXP Internals and Data Structures TBIKERNEL::BLANDNorman Bland 833 3797 CSC, BasingstokeTue Nov 15 1994 15:0311
    
    I have not had time to investigate this yet but the following TBI's are
    available in TIMATOOLS. I have my fingers crossed.
    
    Norman
    
    6-TBI EY-Q157E-L0-0001 OpenVMS AXP Internals and Data Structures I
    6-TBI EY-Q158E-L0-0001 OpenVMS AXP Internals and Data Structures II
    
    16-TBI EY-Q159E-L0-0001 OpenVMS VAX Internals and Data Structures I
    16-TBI EY-Q160E-L0-0001 OpenVMS VAX Internals and Data Structures II 
193.10Looks worth a shot.KERNEL::ADAMSBrian Adams CSC-Viables '833-3026Thu Nov 17 1994 21:099
    
    I've had a look at the Alpha versions of these and they look like 
    VERY useful. So much so, that I'm going to take a copy of the two
    student guides and listings, and work my way through these, in my 
    own time.
    
    Don't know how long it will take, but with some practical in the
    office, to back it up, it might be as good as a course !!
     
193.11good!!KERNEL::ANTHONYThu Nov 17 1994 22:166
    
    please make a copy available for our library
    
    	cheers
    
    		Brian