T.R | Title | User | Personal Name | Date | Lines |
---|
60.1 | | STAR::KLEINSORGE | shockwave rider | Sun Jan 29 1989 14:50 | 10 |
|
Hmmm. Which workstation has a CI? The only one I know of that
"could" use SCS effectively would be the VS8000, but it doesn't
support a CI adapter for it's BI.
A more interesting idea might be using LAT as the transport, it's
simple, small and fast.
|
60.2 | This window manager is confused. | ANTPOL::PRUSS | Dr. Velocity | Sun Jan 29 1989 20:05 | 6 |
| I thought we used SCS on the Ethernet in an NI/MI Vc. There aren't
enough slots in a VS8000 for a CI, but that would be an interesting
tangent!
-fjp
|
60.3 | Not really!! | SKRAM::SCHELL | Working it out... | Sun Jan 29 1989 22:03 | 16 |
| >
> Hmmm. Which workstation has a CI? The only one I know of that
> "could" use SCS effectively would be the VS8000, but it doesn't
> support a CI adapter for it's BI.
>
> A more interesting idea might be using LAT as the transport, it's
> simple, small and fast.
Whoa!! SCS is not a CI only protocol. SCS runs on LAVC's, using
the Ethernet as a transport.
I think the real question is if SCS is a better protocol than
DECnet task-to-task???
Mark
|
60.4 | Forgive me, but I just finished reading all the TCP stuff... | DECWIN::FISHER | Burns Fisher 381-1466, ZKO3-4/W23 | Mon Jan 30 1989 17:38 | 5 |
| Oh great...you want us to "support" this too, or shall we just ship the image
for everyone to play with?
Burns
|
60.5 | | MAXWIT::PRUSS | Dr. Velocity | Mon Jan 30 1989 18:34 | 12 |
| What, you mean you have it working already and are holding out on
us?! :-)
Just an idle question for speculation, really. But I kind of like
the idea of sending stuff to a VAXstation 8000 from the
VAX_THAT_IS_YET_TO_COME over the CI. We know that SCS is much more
efficient than DECnet FAL for file transfer. I have no idea how it
would compare to task-to-task for the X protocol.
-fjp
|
60.6 | | STAR::KLEINSORGE | shockwave rider | Tue Jan 31 1989 00:03 | 11 |
|
My wife did a prototype of the "LAST" (a LAT derivitive) driver that
talked directly to the CI. Don't remember the numbers offhand, it was
*very*, *very* fast.
And raw LAT is probably about as quick as you want over the ethernet
(though DECnet isn't a slouch on the ethernet for just raw data
communication according to her tests).
|
60.7 | Remember DECnet over CI? | DECWIN::FISHER | Burns Fisher 381-1466, ZKO3-4/W23 | Tue Jan 31 1989 09:08 | 10 |
| Well, I can't say I know much about this, but remember when CI first appeared
a few years ago? It was possible to run a DECnet circuit over the CI. After
a while, though, it was determined that it was much more efficient to run DECnet
over the either and SCS over the CI. (Now of course we also have SCS running
over the ether as well). This may prove nothing, except to show that there is
precedent for deciding that is was better to use ether than CI for one particular
class of communication protocol.
Burns
|
60.8 | | LESLIE::LESLIE | Andy ��� Leslie, CSSE / VMS | Tue Jan 31 1989 13:00 | 3 |
| The reason it was inefficient and thus slow was that it still used
DECnet! Using native protocols would be much faster - and is!
|
60.9 | | STAR::SNAMAN | Sandy Snaman, VMS Development | Wed Feb 01 1989 11:19 | 12 |
| Re .7:
Regarding the old wisdom of running DECnet over the Ethernet rather than
on the CI. Some recent performance testing has shown that this has
been a myth for some time.
The advent of processors faster than a 780 made it possible to do
substantially better using DECnet on the CI than on the Ethernet.
|
60.10 | | KONING::KONING | NI1D @FN42eq | Wed Feb 01 1989 18:12 | 13 |
| I think the reason there is an NISCS isn't because it's faster (inherently)
than DECnet, but because it was the way the VAXclusters software could be
made to run on an NI. So it's questionable whether that would be any
better than DECnet to talk X to workstations.
As for LAT, remember that LAT is a request-response asymmetric protocol
optimized for the character-at-a-time interactive exchanges of dumb terminals.
X uses a very different sort of data flow pattern (pipelined rather than
request-response) and is unlikely to run as well, let alone better, on LAT
than on DECnet.
paul
|
60.11 | I doubt that an SCS transport on the Ethernet would be much different than DECnet | STAR::BECK | Paul Beck | Wed Feb 01 1989 20:28 | 9 |
| Paul K is correct in .10 as to the rationale for NISCS. There is relatively
little difference between well-optimized DECnet performance on the Ethernet
and equivalent performance using NISCS. The evidence for this is in the
performance figures of DFS, which uses DECnet, but which comes quite close
to LAVc performance on Ethernet. The numbers aren't identical, but then they're
not doing exactly the same things once they get off the wire. (Comparing LAVc
with DAP will not produce favorable comparison for DAP, on the other hand.)
|
60.12 | Then why is so much effort being put into a DECwindows transport using LAT? | IO::MCCARTNEY | James T. McCartney III - DTN 381-2244 ZK02-2/N24 | Fri Feb 03 1989 16:28 | 7 |
|
An un-announced product to come out of DSG is planning to use LAT to transport
the X-wire protocol. If this is not such a good idea, then what needs to be
done to get them to change their implementation strategy?
James
|
60.13 | ***sigh*** | KONING::KONING | NI1D @FN42eq | Fri Feb 03 1989 17:03 | 5 |
| We who have been trying to change that approach have been wondering about
that as well. So far nothing has worked.
paul
|
60.14 | | RAMBLR::MORONEY | Better to burn out than it is to rust... | Fri Feb 03 1989 22:01 | 11 |
| I would suggest using a separate Ethernet protocol for DECwindows transport,
rather than trying to lay it on LAT or SCS. This way the driver code, packet
formats, etc. can be optimized for the type of traffic expected. I'd guess
that Windows on ethernet SCS would probably do OK, but on LAT would be poor
since, as mentioned, LAT is optimized more for single characters more.
'Windows seems to be a big enough part of DEC's future that it should deserve
its own Ethernet protocol.
-Mike
|
60.15 | | MIPSBX::thomas | The Code Warrior | Sat Feb 04 1989 00:24 | 9 |
| A good implementation of NSP serves quite nicely as a transport for the X
protocol. Since the X protocol consistently generates bidirectional traffic
all data ACKs tend to be piggybacked. Thus the almost all the traffic tends
to be X packets with very little overhead.
Note: VMS DECwindow users may want to raise their workstations pipline
quota to 8K or more to allow DECnet-VMS to request delayed more frequently.
|
60.16 | maybe already being done | ATLAST::BOUKNIGHT | W. Jack Bouknight | Sat Feb 04 1989 17:34 | 6 |
| re: .15, VMS DECwindows startup already checks for and SETs DECnet
EXEC PIPELINE QUOTA to 10000. I assume thats the parameters you
were recommending be changed.
Jack
|
60.17 | | KONING::KONING | NI1D @FN42eq | Mon Feb 06 1989 12:27 | 13 |
| Re .14: just because there is a big market for something doesn't mean that
it should have a protocol of its own. In fact, just the opposite is true:
by using standard protocols, you make the product even more attractive.
That's doubly true since, as was mentioned, X runs well over DECnet and
there is no reason to believe that it will run substantially better over
any other transport. Besides, developing additional transports is
expensive, counterstrategic, etc. It prevents things from running over
wide area networks, gives a "we don't care about standards" message, and
so on.
paul
|
60.18 | | VISA::BIJAOUI | Tomorrow Never Knows | Tue Feb 07 1989 02:42 | 20 |
| �That's doubly true since, as was mentioned, X runs well over DECnet and
�there is no reason to believe that it will run substantially better over
�any other transport. Besides, developing additional transports is
I'm feeling a bit doubtful about this statement. So far, we have had a
number of problems using X over DECnet, links being losts and so on,
and I'm sure a LAT (ethernet, whatever you want) based transport would
be an excellent solution (especially for LAVc's).
Internally, we are moving towards hidden areas to be able to connect
our VAXstations on the network.
DECnet phase V is too far away, and I believe we really need a LAT
(ethernet, whatever you want) transport. At least, something that
doesn't get stuck in a bottleneck that a L2 Router can be in such case.
Anyway, have you ever gather statistics (e.g. packet/sec) of DECnet
usage when using X across it ?
Pierre.
|
60.19 | ??? | PSW::WINALSKI | Paul S. Winalski | Tue Feb 07 1989 14:13 | 26 |
| RE: .-1
> DECnet phase V is too far away, and I believe we really need a LAT
> (ethernet, whatever you want) transport. At least, something that
> doesn't get stuck in a bottleneck that a L2 Router can be in such case.
I don't understand this. If your VAXstation is plugged into an ethernet,
the DECnet runs over it. Stations on the same ethernet can talk directly
to each other without involving any routing node whatsoever, let along a L2
router, if the stations are in the same area.
> I'm feeling a bit doubtful about this statement. So far, we have had a
> number of problems using X over DECnet, links being losts and so on,
> and I'm sure a LAT (ethernet, whatever you want) based transport would
> be an excellent solution (especially for LAVc's).
If you are talking about LAVc, then all of the nodes MUST be in the same
DECnet area, and they all must be on an ethernet. DECnet works just fine
without involving any routing nodes in these circumstances, and it uses
ethernet. Assuming that your ethernet hardare is configured properly, the
only case where you should be seeing logical links broken is when one or
the other machine goes down, and there is no preventing that. I don't
understand your problem here.
--PSW
|
60.20 | | STAR::KLEINSORGE | Toys 'R' Us | Tue Feb 07 1989 14:48 | 21 |
|
Paul, it's a common perception that LAT often works better than
DECnet on the ethernet, especially if you've ever been on one
of the segments in ZK. I often have two machines next to each
other that refuse to see each other, and CTERM over a couple of
bridges in this building can be a hazard. On the otherhand, I
quit using SET HOST a long time ago because LAT proved so much
more reliable (hence the perception) and often I get pissed
when a copy tells me that my node isn't reachable when I'm
VWSLATed from my node at the time I get the message.
It may be that the difference is that DECnet is picky about
making sure that the data actually gets there and gets there
correctly, while LAT assumes everything is peachy and has much
less error checking and looser "tolerences" (an error!? hey, let's
send it again...).
Anyway, from a typical "user" LAT usually looks more reliable.
|
60.21 | !!! | VISA::BIJAOUI | Tomorrow Never Knows | Tue Feb 07 1989 14:59 | 68 |
| Re: .19
�I don't understand this. If your VAXstation is plugged into an ethernet,
�...
No. Not if the VAXstation is in a hidden area (which is, for our case,
area 63). In the area 51 (the regular one), we got a L2 router which
talks to another L2 router which stands in area 63. The path for a
packet from a satellite to the boot node (which stands in area 51,
because we need access to the WAN) is then thru the two L2 routers.
I believe you can get more info in the notesfile IAMOK::HIDDEN_AREAS.
�If you are talking about LAVc, then all of the nodes MUST be in the same
�DECnet area, and they all must be on an ethernet. DECnet works just fine
No, the nodes aren't in the same area, but they on the same LAN.
Although they are on the same LAN, they got to go thru the 2 L2
routers for *DECnet* communications (but not for LAT or SCS
communication).
�ethernet. Assuming that your ethernet hardare is configured properly, the
�only case where you should be seeing logical links broken is when one or
�the other machine goes down, and there is no preventing that. I don't
No, we have had cases where links were lost without having one node or
the other being down. It's just that the DECwindows server just can't
cope with the buffers (as I understood it).
Note #293.0 in the notesfile HANNAH::DECW$DISK:[PUBLIC]DECTERM describes
more the problem. I quote without permission some of the content of the
note. You can go in the notesfile to get the exact context, for better
accuracy.
� Occasionally when the server has replies and events to write to a client
� and network output buffers are unavailable to perform the write operation,
� the current server would attempt the same write for a number of times
� prior to disconnecting the non-responsive client.
� In the duration of the retries, the server would not serve any other
� client, and to the user, it would appear that the server is hung.
As you can see the server will hang, but sometimes, I believe when
time-out occurs, the server just gives up and drops everything on the
floor.
As a fix for the moment, we raised the Maximum buffer parameter in the
boot node exec (from 100 to 200) and the pipeline quota. And wait and
see.
In our area (51), we should run out of numbers in a couple of months.
What will happen to the dozen of VAXstation I have ordered ? Run them
standalone, out of the network ? Naah, everybody needs the net, so we
just got to squeeze our elbows, waiting for DECnet phase V that should
(as I understood it) solve the limitation of 64 areas and 1023 nodes
per area, and use the concept of hidden areas.
There maybe other concepts, but I ain't a specialist in this area,
IAMOK::HIDDEN_AREAS covers more of the problem.
(sigh) C'est la vie !
Pierre.
|
60.22 | | KONING::KONING | NI1D @FN42eq | Tue Feb 07 1989 16:37 | 18 |
| There are definitely some misunderstandings about DECnet going around here,
which isn't helping the signal to noise ratio.
It does NOT matter whether your areas are the same, different, hidden, or not.
If you're going from one endnode to another on the same Ethernet, then
traffic will go direct (after a few initial packets). If the host is a
router, then things aren't always that efficient, but then again if you
run routing on your hosts things are slower anyway.
As for DECnet being flaky, there may be some resource allocation problems,
bugs, or whatnot. Certainly things can get bad when some of the routers
in the area are inadequate (e.g., 750s or worse). There is nothing in the
architecture that makes DECnet any more or less reliable, as far as
links staying up is concerned, than LAT. Certainly there is no such
issue as "less error checking" or "looser tolerances".
paul
|
60.23 | | PSW::WINALSKI | Paul S. Winalski | Tue Feb 07 1989 16:54 | 27 |
| RE: .21
You are assigning the blame for lost client/server communication to the wrong
place. The DECnet logical link remains intact--the problem is that the X
server is single-threaded and times out client applications on its own,
independent of the state of the DECnet logical link. This is a bug in our
current X server implementation and is independent of the protocol used to
provide the client/server transport. Switching to SCS or LAT would not solve
the problem--the X server would still run out of buffers and you'd still
be disconnected. This is a problem that should be fixed where the problem
occurs--in the server.
As far as hidden areas go, we should not be making strategic product design
decisions (such as what protocols to use for X) on the basis of temporary
configuration problems on our own internal network.
RE: LAT
No question about it--LAT performs magnificently for what it was designed to do,
which is to package single-byte transmissions on multiple virtual circuits into
a single ethernet message between a terminal server and its client CPU. It
is better than CTERM at this. However, X is a message-passing protocol, and
I question whether LAT would work as well as DECnet or SCS.
--PSW
|
60.24 | | VISA::BIJAOUI | Tomorrow Never Knows | Wed Feb 08 1989 03:19 | 34 |
| Re: .22
Well, believe it or not, our DECrouter2000's (which are the most powerful
L2 routers at the moment, correct me if am wrong) do see the packets we
are sending from one workstation to the boot node.
As well, how should I consider the Appendix A, paragraph A.6, page
A-16, of the Networking Manual ? Have they got it wrong ?
Re:.23
From my user's point of view, what I see is a *lost* DECnet link.
Whether it's DECnet or a server or a client doesn't matter to me. The
link is lost, my work is lost.
I'm glad you've found the bug, I'm sure it will be fixed for a future
release of DECwindows. If, on top of that, it suppresses the occasional
hangs I have on my VAXstation, then perfecto.
�As far as hidden areas go, we should not be making strategic product design
�decisions (such as what protocols to use for X) on the basis of temporary
�configuration problems on our own internal network.
I definitly agree. But I didn't imagine that adding another transport
to the set of DECwindows' transport could be a "strategic product
design".
Nevertheless, I will ask again my question:
Has anybody ever mesured the packet rate per second (for instance) over
DECnet that DECwindows generates from a local application to a remote
display ? Any statistics produced ? Any performance tests ?
Pierre.
|
60.25 | Area 51 speaking | CASEE::LACROIX | No future | Wed Feb 08 1989 03:40 | 23 |
| Re previous:
I'm in area 51 too... We have lots of workstations on a private
Ethernet segment, and we were running into this problem of DECnet links
being dropped on the floor (yes, it could be the X server timing out of
its own). Gurus in CASEE came with a very successful hack a couple of
months ago: basically, whenever a workstation was rebooted, NETACP on
the boot member was going paging like crazy going thru the entire net
database, looking for info on the workstation. That, plus the MOM
process and a too small working set for NETACP was causing *ALL* X
connections between the boot member and other workstations to be
aborted. The fix is to use an area number small enough to cut down on
NETACP's paging rate: area 1. Our boot member now thinks our
workstations are in area 1, and thus finds info on what it should do
with our workstation turbo fast. No more paging, no more links dropped
on the floor, no more 10 seconds cluster transitions, etc...
Inidentally, folks with talk to in the states were not very receptive
to the problems we were having; I suspect this is related to the fact
that you have a smaller problem when all your satellites are in area 3.
Denis.
|
60.26 | | STAR::BRANDENBERG | Intelligence - just a good party trick? | Wed Feb 08 1989 09:54 | 20 |
| re: various
What PSW said about the location of the problem is absolutely correct.
The problem begins with a poorly designed protocol, is aggravated by
the VMS interface to DECnet, was only partially corrected by the
transport, and a last-chance keep-alive effort was made in the server.
There is work-in-progress to make future versions better. What can you
do now? Use tcp/ip. Yes, even for vms-to-vms connections.
Lat? It's being looked at but what paul suggested may be true. A
protocol may or may not save you. Lat works nicely when there are many
data sources mapped to many data sinks but what will happen when there
is a *single* data sink (a server).
As for network load statistics, a test suite has been created and
numbers have been collected. A report is being written (I haven't seen
it yet). It should be interesting.
monty
|
60.27 | | KONING::KONING | NI1D @FN42eq | Wed Feb 08 1989 11:19 | 8 |
| Re the problem of NETACP taking so much time on downline load requests:
that certainly is a problem. It has been known for years. There are
various obvious solutions that haven't been implemented. However, none
of that has ANYTHING to do with the issue of which transport is appropriate
for X.
paul
|
60.28 | Technical reasons for protocol problems? | WINERY::ROSE | | Wed Feb 08 1989 14:45 | 6 |
| Re .26: "The problem begins with a poorly designed protocol..."
I realize this is kind of complicated, but could you please elaborate?
(This is not an argument, but I am just very curious because when
reading over the X protocol I did not see anything particularly wrong.)
|
60.29 | You'd think they'd learn after a while | PRNSYS::LOMICKAJ | Jeff Lomicka | Thu Feb 09 1989 12:54 | 6 |
| It seems like the modern equivalent of assuming all computer terminals
will operate at 38.4KB continuously without the use of xon/xoff...
Figures, considering the source.
|
60.30 | | DECWIN::FISHER | Burns Fisher 381-1466, ZKO3-4/W23 | Thu Feb 09 1989 15:30 | 5 |
| A couple of notes here were hidden pending a discussion among the moderators.
We got a complaint.
Burns
|
60.31 | Wait a minute... | CIM::KAIRYS | Michael Kairys | Thu Feb 09 1989 15:42 | 21 |
| I would like to complain in the reverse direction. I was fortunate to
have read note .29 just minutes ago, prior to its being set hidden. I
believe I can guess what prompted the impulse to hide it.
However, I think the note presented information and a point of view
that is important and needs to be aired. I think .29 should be used to
start a discussion about real-world requirements which may (and should)
lead to those requirements being addressed. My area of concern is
discrete manufacturing; perhaps not as "critical" in some senses as
nuclear engineering but nontheless an area which demands dependable
delivery of information and needs windowing technology.
Perhaps the note could be slightly edited, if someone insists, and
returned to view. Personally it didn't seem inflammatory to me, but I'm
from Ann Arbor...
BTW, I also think note .31 presents a point of view about the history
of X that is worth (re?)stating.
-- A Concerned Citizen
|
60.32 | | DECWIN::FISHER | Burns Fisher 381-1466, ZKO3-4/W23 | Thu Feb 09 1989 16:57 | 7 |
| There was not an "impulse" to hide it. Someone (not from VMS development, I
might add) was concerned about aspects other than inflammation.
Please let it go at that for the moment. I did not say this was the final
word. That is what "hide" is for as opposed to "delete".
Burns, unfortunately a moderator
|
60.33 | Odd that LAT, not SCS, is the main topic when LAT was in another note | CVG::PETTENGILL | mulp | Thu Feb 09 1989 19:25 | 29 |
| re: .23
>No question about it--LAT performs magnificently for what it was designed to do,
>which is to package single-byte transmissions on multiple virtual circuits into
>a single ethernet message between a terminal server and its client CPU.
The above statement is about `1/3 true'.
Bruce Mann usually talks about his experience developing network applications
(based on DECnet) when talking about the goals he had for LAT. He wanted a
fast (ie., low in network and CPU overhead), fast (just in case you missed it
before), simple (ie., something that didn't take an army of programmers and
managers), simple (ie., something that one person could do and that would be
implemented widely), LAN transport. Most of the work that Bruce was doing was
realtime data aquisition, but terminal character echoing is best if it is in
realtime, so terminal I/O is very applicable. LAT is NOT Local Area Terminal;
LAT is Local Area TRANSPORT.
LAT and SCS have a number of things in common (Bruce was involved in the
architecture of both): They both multiple multiple sessions over a single
virtual circuit and they both plug into the applications in the kernel rather
in user mode. While these points make interfacing them to the system more
difficult, there is usually a payoff in terms of performance.
LAT was always intended to a multipurpose tools for supporting specialized
LAN applications. X was intended to be a LAN application. Depending on how
users use X, LAT+X may be a real winner. If X replaces ASCII, as it does with
an X terminal, then the use will be right as fas as I can tell.
|
60.34 | Regarding the Protocol | STAR::BRANDENBERG | Intelligence - just a good party trick? | Fri Feb 10 1989 12:36 | 93 |
| re .28: Yes, you did, it's practically on page one but it's so huge,
none seem to notice. Consider typical client/server operation: the
client sends asychronous requests to the server while the server sends
asynchronous events to the client (say resulting from mouse motion of
window reconfigure). Only occasionally do the server and client come
together and synchronize their communications with a request/reply
pair.
What does this mean? It means that the only thing that keeps a
client/server connection running is the buffering capability of the
underlying transport implementation. A server in the throes of
generating motion events or window reconfigure events will run through
code that commits the server to sending events to at least one and
sometimes many client connections. When this happens, the buffering
capacity had better become available soon or the server will wait
until it does. The way the user's data is buffered on Berkeley-style
networking implementations, it often is available. But, say with a
record-oriented interface and quota scheme as with the VMS interface to
DECnet, it would almost never be available without additional work by
the application. (This is one of the intended functions of the common
transport image on VMS.)
"Well, then it's a VMS problem, isn't it?" No. I'm the first to admit
that the VMS interfaces are often inconvenient for getting work done
but in this case, they merely exaggerated a problem with the protocol
they did not create it. About two years ago, after finishing one of
the first ports of the server to VMS, we experienced frequent deadlocks
due to this problem (I should say we experienced infrequent successful
operation). I poked around, looked at the system, looked at the design
and said, "Look, this protocol is a deadlocking protocol." I received
very little indication that anyone understood the problem or that they
anyone was interested. At this point, in my opinion, we should have
worked on the server semantics, or changed the protocol, or...
something but it didn't happen.
"Well, um, in R3 they fixed xlib to keep reading from the server if it
can't write requests." Yes. On Unix. But is that enough? Must the
operating system provide the means of recovery from a bad protocol?
Should a "reliable, production-quality, bullet-proof" server rely on
the good behaviour of its clients to ensure that it continues to
execute? Should it rely on the stability and predictability of a
network populated with LAVC's, NFS-served systems, diskless systems,
gateways, bridges, etc.? Should it rely on some unknown operating
system scheduling its clients so that it can continue operation? These
are the sorts of questions one must ask when designing a reliable,
distributed system. Answers are even better but I don't have any
that are clear and absolute. How about some scenarios? Here are some
possibilities which I can imagine (though they may not exist in fact).
And yes, they're pathological but they are intended as illustrations
to encourage discussion of the technology.
1) A standalone workstation whose user has a few xterms, a
wmohc (window manager of his choice), a clock, etc. He runs an X
application which creates windows, does some work, and interrupts it
leaving it around but not running. He goes on to do other things like
pop windows and drag his mouse around. All of a sudden, his
workstation hangs while the server tries to send some events to a
client that isn't running. How do you recover?
2) A workstation on a network has a client from a diskless workstation.
The link gets a bit behind while the client tries to write some requests
so it, being an R3 system, dutifully tries to read from the server. But
the code that reads takes a page fault and the NFS server has just crashed.
Three seconds later, the X server wants to tell this client about the
180 motion events that have occurred and so it hangs. All because of a
nfs server *two hops away*.
"But, I've been programming on X workstations for years and it's
usually worked for me!" Well, so what? Is this proof by example?
Let's be honest with ourselves: the primary use of X systems up to
this point has been as programmer's workstations, to develop
programmer's tools, all to help programmers. Only now is it moving out
into non-programming and non-engineering tasks. I hope I'm not
bursting anyone's bubble with this proposition but, in my opinion, the
standards of reliabilty and quality to which programmers in the world
at large hold themselves *do not* compare favoribly with those in most
other engineering and non-engineering activities. By analogy,
programming is to, say, civil engineering what astrology is to
astronomy or what numerology is to mathematics. Consider: a power
company might investigate using a workstation to display the operating
status of a fission reactor. Or medical equipment companies who'll
make instruments to monitor patients in surgery. Or manufacturers
who desire to control time- and position-critical processes in a
steel mill. When one builds a skyscraper, it is anchored in bedrock
not in mud. I believe that this good-enough-for-programmers-so-its-
good-enought-for-everybody attitude is *unacceptible* when the products
of these programmers are actually used by the rest of the world.
In taking this opportunity for a little bombastic opinion, I hope I was
able to adequately describe the protocol deficiency as I understand it.
monty
|
60.35 | re: .30 | STAR::BRANDENBERG | Intelligence - just a good party trick? | Fri Feb 10 1989 12:37 | 14 |
| (I've been stewing for two years but I'm feeling better now.)
Yes, I'm not too happy with the design but I can be fair. The "Boys
From Cambridge" didn't set out to solve the world's display problems
so many years ago (at least by my understanding). They created a
system that was built for programmers and students and it may be
adequate for that purpose. I certainly like to use the tools and the
environment for my work (programming). But first by accident and
then by *executive decree*, it was decided to make a commercial system
out of this that would solve the everybody's needs. I am personally
uncomfortable with the way in which these decisions were made.
monty
|
60.36 | Well, look at where the market is putting its money | POOL::HALLYB | The smart money was on Goliath | Fri Feb 10 1989 16:29 | 12 |
| Nor is this the first example of the marketplace demanding an inferior
product. Your PC (Apple or IBM) crashes? Oh, well, reboot it and get
on with things.
These kinds of problems are seen to be like cars stalling then starting.
No big deal, it costs too much to engineer perfection.
Nuclear reactor operations? We'll buy two. They won't both fail just
prior to meltdown. Etc.
John
|
60.37 | Slight time warp
Slight time warp (Old noters: remember those?) | DECWIN::FISHER | Burns Fisher 381-1466, ZKO3-4/W23 | Fri Feb 10 1989 16:48 | 3 |
| For the record, .36 and .37 replace some notes which were deleted (29 and 31,
I think.) That is why the context and order seems a bit funny.
|
60.38 | The inevitable follow-up questions | WINERY::ROSE | | Fri Feb 10 1989 19:13 | 15 |
| RE .36: Thank you, this is very interesting. Disclaimer: These are
questions -- not arguments. I'm trying to understand your note, not
rebut it.
Are you contending the following? It is impossible to write a server
that does not hang if a client hangs and if enough events occur that
are directed to that client.
You say this is even true over TCP/IP, just that the probability of
hanging is much lower under TCP/IP on Ultrix? Is that because TCP/IP on
Ultrix allows more stuff to be in the pipeline undelivered?
An even more general question: What is the simplest change to X that
would make it possible to write a hang-free server?
|
60.39 | The Ultrix server seems hang-proof | FLUME::dike | | Sun Feb 12 1989 11:38 | 10 |
| I checked the Ultrix sources, and it doesn't look like the server is capable of
hanging. The server sets up connections so that if a read or a write would
block, the call returns immediately with EWOULDBLOCK. If the call was a read,
the server services other clients until the rest of the data comes through. If
it was a write, the client is punted.
I don't intend to claim that anecdotal evidence amount to proof, but I have
never heard of an X server on Ultrix hanging in a read or a write.
Jeff
|
60.40 | The problem is not in managing the line but the resources consumed by the server. | IO::MCCARTNEY | James T. McCartney III - DTN 381-2244 ZK02-2/N24 | Sun Feb 12 1989 18:18 | 23 |
| RE.: .41
Consider an application that enables mouse events, then promptly goes off to
"sleep" (ignores making a call to get the next event). The process may actually
be doing something useful (like an FFT or Finite Element model). Meanwhile, the
impatient user is idlely dragging the mouse around generating 1000's of events
per minute. The server, attempting to preserve these events, is packaging them
up as quickly as it can shipping the out to the client. Eventually, the clients
network buffer fills, the network transport layer screams "No more..." and the
server has to decide to buffer it locally, or to drop things on the floor.
Early servers, attempted to do no buffering and simply aborted the link, causing
intrinsic reliability problems. I can't speak for the existing VMS and Ultrix
server's having not seen the code, but I believe that this is one of the
problems to which Monty is refering.
In extremely severe cases, it is possible that the server will exhaust it's
resources trying to buffer events locally, and thus hang. Until the dormant
program gets around to reading it's event queue, nothing can be done on the
server.
James
|
60.41 | %DECW-F-IPI-Insufficient programmer intelligence failure at ... | IAGO::SCHOELLER | Who's on first? | Mon Feb 13 1989 10:16 | 9 |
| re: .42
That is why we have been frequently reminded to not write progams that
disappear for a long time without check the event queue. A small amount
of intelligence on the part of the application developer prevents this
client from being punted.
Dick
|
60.42 | Look at what has come before | STAR::BRANDENBERG | Intelligence - just a good party trick? | Mon Feb 13 1989 11:30 | 71 |
|
re .40: Is it possible to write a hang-free server? If it is not
acceptible to drop a connection at the first sign of a hang, then
I believe it is impossible to write a *reliable*, hang-free server.
In previous replies (to which I will respond shortly) note how recovery
takes place: if a server write blocks, drop the client. Most
low-level networking protocols implement some sort of quota system
(windows or debit/credit or ... ) in the protocol itself. The X
protocol implements it in the operating system interface (if it doesn't
fit, kill the connection). This is one thing that must change if we
are to have a reliable server. There are at least two ways that this
can happen: either by changing the protocol and server semantics to
include a debit/credit system for server-to-client communication or
by changing them to allow unreliable delievery of events.
I'll consider the latter first. In certain areas, the X server has
already made some movement in this direction. With the realization of
how large a load can be generated by mouse motion events, the designers
created a "motion history buffer" in the server. If we're generating
events too quickly, and the client allows, put motion events in this
buffer and report to the client, via events, that there is something
interesting in the motion history buffer. While this implementation is
along the lines of an infinite buffer approach, look at what they're
really doing:
1. Server attempts to send report and fails (or might fail).
2. Server stores state change (mouse motion).
3. Server reports to client availability of state change
(motionHints is non-zero or whatever).
4. Client synchronously requests report of state change
(getMotionHistoryBuffer).
Generalizing this and changing the implementation, would give a server
that doesn't *insist* on sending every single event and a chance at a
reliable, hang-free server.
Or, how about a debit/credit system? Xlib could piggyback event credit
values on requests. An initial maximum could be inferred from the
networking quotas and that particular networking interface. This still
implies a server that isn't required to send events or one that is able
to encapsulate state changes to be sent later. Another way is to
change the communication model to something along the lines of an RPC.
Asynchronous client requests could still be asynchronous but server
state changes (i.e. events) would be acquired mostly synchronously.
There might still be an event credit to retain interactivity if it is
shown to be necessary but Xlib calls such as XNextEvent would become
request/reply pairs. These are just some ideas, nothing has been tried
but I think these are interesting avenues to pursue.
As for the tcp/ip vs decnet and ultrix vs vms issues... it's more a
matter of programming interface than either base protocol or host
operating system. VMS tcp/ip (connection), Ultrix tcp/ip, and Ultrix
DECnet all work pretty much the same: they're byte-streamed,
socket-derived interfaces that buffer user data in an almost pure byte
limited fashion (I believe, 4K per direction and per side is the
default in all the above mentioned implementations.) VMS DECnet, on
the other hand, has a record-like quota system based on segments with a
$QIO more-or-less generating at least one segment. The byte-stream
model allows a client and server to run skewed which is to some degree
a requirement in *any* distributed system. (Imagine trying to pipe
some shell commands together if the byte-quota on a pipe was, say, one
byte.) Unfortunately, this model is also tolerant of protocol design
failures. Architectures which are intrinsically deadlocking appear to
work simply because the deadlock condition is unlikely and the allowed
response to a deadlock, if the interface allows it to be sensed, is to
give up.
Just some thoughts...
monty
|
60.43 | Oh, yes, an experiment. | STAR::BRANDENBERG | Intelligence - just a good party trick? | Mon Feb 13 1989 11:38 | 16 |
|
re .40: I've had an idea for an experiment for sometime but I can't
get the resources to perform it. The idea was to get two ultrix
machines on their own ethernet and setup an X test environment that
would allow be to create an arbitrary cpu load on either a server or
client machine. I would then vary two variables, the load on a
system (either server or client) and the mbuf quota for links, and
observer and measure the reliability of various interactive
applications. My belief is that connections will become markedly
unreliable as quota is dropped. My contention is that there is no
threshold at which a connection becomes reliable; that there is only a
curve giving probability of failure which is never zero and which is a
function of so many variables that we can never say "you're safe."
monty
|
60.44 | Hang-proof isn't the same as reliable | STAR::BRANDENBERG | Intelligence - just a good party trick? | Mon Feb 13 1989 11:48 | 8 |
|
re .41: You are absolutely correct in the ultrix case. But first,
only Unix has the nice FNDELAY option and must this be used to
implement the protocol and server semantics? And second, it doesn't
hang but is it reliable? Can't Joe Customer have both?
m
|
60.45 | | STAR::BRANDENBERG | Intelligence - just a good party trick? | Mon Feb 13 1989 12:16 | 36 |
|
Re .43: This is an extremely poor attitude to take. I've already
complained that X must rely on networking implementations to survive
(an inappropriate mixing of levels) and now you're suggesting that the
remaining slop be taken care of by the application programmer. By the
goodness in our hearts, we'll make this work?
Truthfully, what justification is there for a call to
ProcessInputEvents() in the outer loop of a 2D FFT? Or an image
convolution? Or a large, atomic database transaction? Or any of the
other things that makes money for our customers? I could argue from
aesthetics (it's ugly), or structured programming paradigm (it's mixing
levels), or from performance (it ruined the register optimizations), or
from programmer convenience (they have to do everything), or from a
quality assurance standpoint (more and more testing just to see if they
can keep X alive). And I claim it still isn't enough.
The server can't control the application environment. The application
may be on another machine, on another operating system, in another
country. Well, neither can an application programmer completely
control the application environment. The programmer can't control when
his process will be scheduled, can't control taking a page fault served
by a crashed nfs server, can't control slow or overloaded or unreliable
networks, etc. etc. etc. The application programmer tries to get his
algorithms correct and relies on the correctness of the system software
to get the rest done. Is the programmer's trust well placed?
We are trying to create a reliable, distributed, interactive, graphical
system. (Those four adjectives are *very* important.) I believe this
is the single hardest networking problem anyone has yet seen. It's
more difficult than the base networking support (tcp/ip, udp/ip,
decnet, whatever), rpc's, remote terminals, distributed filesystems,
naming services, etc. And I think it is not yet solved.
monty
|
60.46 | The future's not bright so take off your shades | STAR::BRANDENBERG | Intelligence - just a good party trick? | Mon Feb 13 1989 12:25 | 14 |
|
Those who can begin to see the stochastic nature of these systems might
think about the future. The range of networking speeds is increasing.
Some people insist on serial line interfaces to X while others are
preparing for FDDI and HSC. The range of CPU speeds is increasing.
Two years ago, everything was pretty much one- to three-mips. Servers,
clients, pc's, routers, hp handheld calculators, etc. Now we'll have
Cray's, Connection Machines, Multiflow's, DAP's, MIPS boxes, SMP vaxes,
on down to 68000-based X terminals. This reliability curve I mentioned
(really a reliability manifold) is dependent upon all these variables
and others. What is it going to look like in the future?
monty
|
60.47 | | KONING::KONING | NI1D @FN42eq | Mon Feb 13 1989 12:33 | 5 |
| Note that many of these would be non-problems if the operating systems we
use had decent multithreading facilities built-in.
paul
|
60.48 | You've just moved the deadlock | STAR::BRANDENBERG | Intelligence - just a good party trick? | Mon Feb 13 1989 12:52 | 17 |
|
Re .49: Do you mean for use by the server, one thread per connection?
If so, I think not (though others in VMS think it would be wonderful).
The problem is that clients intentionally and necessarily interact with
one another. They share real estate, keyboards, colormaps, etc. and
when one client changes these, the others may need a report.
XSendEvent, properties, and selection encourage communication between
clients. And, because all these resources are shared, the database
which maintains them is also shared. And then there are clients which
require atomicity across multiple X operations (such as the window
manager) hence locking out other threads. All this communication
between clients implies locking, if a client needs a lock held by
another client who is blocked by transport, that client will also
block. Conclusion: server deadlocks can still occur.
monty
|
60.49 | Insufficient Architecture | EVETPU::TANNENBAUM | TPU Developer | Mon Feb 13 1989 13:21 | 16 |
| Re: .43
Yup, DECwindows requires that an application frequently check the input
queue. TPU had to jump through hoops to implement this. And it's
still not right. I recently found that TPU's not checking the input
queue while a subprocess is running (so don't do anything large in a
subprocess and then wiggle your mouse on a DECwindows EVE window).
How many other places have we missed, simply because no one considered
yet another obscure area of the code?
It would be a *LOT* easier if this was handled once, correctly, instead
of trying to duplicate it in every application.
- Barry
|
60.50 | ? | WJG::GUINEAU | | Mon Feb 13 1989 15:58 | 13 |
|
Funny, My first use of X (DECwindows) was for an application that would
go off for more than 1 hour as a result of one mouse click. While it
was gone, the interface was dead! After a few contortions and mucho help
from this notes file, I got it all working by spreading ProcessXQueue();
calls all around the "work routine".
I never suspected the far reaching implications this really had, but
figured there must be a better way (like have a separate thread do the X Queue
Processing asynchronous to the rest of the application.)
John
|
60.51 | | KONING::KONING | NI1D @FN42eq | Mon Feb 13 1989 17:53 | 14 |
| Right. I was referring to the application, not the server.
On the server side, there has to be a better way too. For example, events
could be discarded when there are too many pending transmission to a particular
client. Such flow control would of course have to be on a per-client basis.
Then when the flow starts again, the client would receive a "you just lost
some events because you were too slow" event along with the subset of real
events that was kept. (You may recognize this approach -- it's the one used
in DNA for event logging.) It may or may not be appropriate for the server
to provide some feedback to the user (bell, or some such?) in addition to
the events-lost event that goes to the client.
paul
|
60.52 | Thought about that, too. | STAR::BRANDENBERG | Intelligence - just a good party trick? | Tue Feb 14 1989 10:13 | 8 |
|
We argued the possibility of an "events lost" event but the problem is
with recovering the state change information from the server. These
changes are quite complicated and must be retained in some form for a
client to keep it's environment in order.
m
|
60.53 | So, how about some feedback? | STAR::BRANDENBERG | Intelligence - just a good party trick? | Tue Feb 14 1989 10:36 | 2 |
|
|
60.54 | DECW-F-NONMODULAR Program author not aware of DECwindows in 1967. | IO::MCCARTNEY | James T. McCartney III - DTN 381-2244 ZK02-2/N24 | Tue Feb 14 1989 16:13 | 79 |
|
RE: .43
I don't suppose that your are suggesting that we call the authors of packages
like IMSL, SPSS, STRUDL, CHEATAH etc. and inform them that their carefully
optimized matrix operations take too long. When we tell them that they should
break up their routines for DECwindows applications (because we're incabable of
building a robust server that avoids such complications), their reaction will be
the same as mine - laugh and go find a hardware vendor that build computers not
toys. If they had wanted a toy they would have called MATEL.
Seriously, if we can't solve the problem of flow control on the X event queues
and come up with a realistic interpretation of what to do when the transport
becomes clogged, we will have some very unhappy customers. Some of their sources
have been in existance since the middle 60's and the programmers that wrote the
codes may have actually retired! Cracking open all these dusty decks simply
because DECwindows comes along is not a good reason. (This assumes that one
calliously disregarding the modularity concerns is a viable option. Since
we've heard over and over from these vendors: "Give us faster hardware, better
and more interactive interfaces, but don't make us rewrite our codes.", we know
it's not!)
RE: .55
Feedback: Complete agreement with ideas expressed so far. The only thing that
still needs some discussion is what to do about the "lost events" event.
I see the problem with the need to keep the application and the server in sync,
but the hang (or hang-up) solution is definately not adequate. If an application
was to get a "lost events" event, would it not be safe for the application to
assuem that it should initiate it's own recovery mechanism? For instance, unmap
all windows and remap to restore "correct" appearances?
How does discarding input events cause problems? Applications already know how
to tolerate typeahead buffer overrun. Simply droping mouse or keyboard events
that cannot by buffered should be sufficient. This behaviour is (I believe)
consistant with existing experience and provides a system that will degrade
with diginity.
Some special feedback mechanisms needs to be provided by the server to ensure
that this overrun condition is quickly detected by the human operator. I believe
there are only three different mechanisms that must be provided, keyboard event
loss, locator motion event loss, and locator button event loss. For for keyboard
event loss, simply ringing the bell al� the terminal driver is sufficient. This
same mechanism may also be useful for mouse button event loss. The difficulty
is to find reasonable feedback for the locator motion event loss.
For locator motion, we want to preserve the ability to move to another
application and continue work there, after all, concurrency is one of the good
things that workstations provide. Also the application we might be moving to is
our "hot backup" of the session that has encoutered overrun problems. Given that
you accept these design parameters, we obviously cannot just ignore locator
motion input. We must also track the cursor location on the screen accurately,
so we can't just refuse to update the cursor. This leaves only two variables,
shape and color. Perhaps we can define a cursor shape or color which can be
interpreted as "locator events being discarded" Perhaps the locator cursor could
alternate between two different shapes in this (abnormal) case. I don't know
what the best answer is for this problem - comments?
As to what an application should do for lost events, we can easily answer these
questions. If the keyboard events are discarded, it will be as if the user never
struk the key. The application will be unaware of the lost events. For locator
button events, especially timing sensitive double and triple clicks, the lost
events will not be in the data stream but the "lost events" event will be. The
application can take action based on this new event type - usually to ignore
any partially completed operation. For locator motion, applications already have
to be able to process non-linear motion since the tablet reports position and
not relative information.
I admit that accurate locator button tracking is difficult, especially since
there are timing windows with can cause a lot of pain. For instance consider
the problem of what happens when you are in a marginal network condition, have
down clicked to make a pull-down selection, started moving the mouse, buffer
overrun occurs, you continue to move the mouse (discarding events), buffer
overrun clears, and you release the button. Unless the application is careful,
this situation can lead to disaterous results.
Comments?
|
60.55 | | KONING::KONING | NI1D @FN42eq | Tue Feb 14 1989 17:34 | 8 |
| Clearly the crudest possible response for an application that receives an
"events lost" would be to give up. That would make it no more crude than
the present approach. Of course applications can do better; how much better
depends on the application, the skill of the designer, etc. I'd certainly
go along with the comments in the preceding response.
paul
|
60.56 | | PSW::WINALSKI | Paul S. Winalski | Tue Feb 14 1989 17:56 | 9 |
| I like the idea of an "events lost" event. The author of an application knows
which events the application has elected to receive. The application is in
the best position to determine whether the loss of events is recoverable or
not--right now, it is the server that decides (and it always decides that an
event loss is unrecoverable). My educated guess is that the vast majority of
"events lost" events would indeed be recoverable by the application.
--PSW
|
60.57 | | MYVAX::ANDERSON | Dave A. | Tue Feb 14 1989 18:16 | 7 |
| To make the decision easier for the application, report what type of
events were lost (keyboard input, mouse motion, mouse button, etc?).
This requires maintiaining only a negligible amount of additional state
information.
Dave
|
60.58 | More ideas | DECWIN::FISHER | Burns Fisher 381-1466, ZKO3-4/W23 | Tue Feb 14 1989 18:19 | 19 |
| For "events lost" we should probably allow the client to say something
about what events he can tolerate loosing (a hint, presumably), and
also, the "events lost" message should probably tell something about
the nature of the events. For example, if the client knew that the
messages lost included mouse motion and expose, it could completely
repaint itself and Query the mouse position.
BTW, there is a conference discussing X protocol change proposals.
It's not very active, but maybe it should be.
BTW2, I would like to hear some more discussion of why/why not this is
a problem on TCP. If I were to lobby for something like this, I would
need to make good arguments to Unix people. (Don't take that to mean
that good-ole-Burns will get this little protocol thing solved for the
next version. This would take more than a little deep thought,
argument, and lobbying)
Burns
|
60.59 | It applies to EVERY transport | KONING::KONING | NI1D @FN42eq | Wed Feb 15 1989 12:22 | 30 |
| The problem is clearly independent of transport. It applies equally well
to TCP/IP, to the local transport, and so on.
After all, the problem isn't really the transport at all. The problem is
application level flow control: the possibility that the server is generating
data (events) faster than the client is accepting them. As things stand,
the application layer flow control is mapped into transport layer flow control,
since the client stops issuing Transport receive requests, which eventually
blocks Transport send requests at the server. So the server application
ends up with data that it can't send.
It's usually well understood that distributed applications require flow
control to bound the size of the queues. There are a couple of possibilities
in the general case:
1. Design the receiver such that it is guaranteed to run at least
as fast as the sender.
2. Have the sender stop generating new data when the queue is too large.
3. Discard data when the queue is too large.
X does none of these; it uses the "off with his head" approach. Given the
properties of X, #2 is not possible (event generation is controlled by the
user at the keyboard/mouse, not by the server alone). #1 is also not
practical, so that leaves #3.
Note that I didn't mention DECnet anywhere in this discussion; it's all
transport-independent. (Or you might say that the whole discussion was
in the application layer, not the transport layer.)
paul
|
60.60 | I think "KISS" is the necessary magic | BORA::MARTI | Beat Marti - ISV Support - MR4-1/H19 - 297-3074 | Wed Feb 15 1989 13:50 | 18 |
| The problem is not one of transport. It should also not be left up to the
application (or application programmer sprinkling silly event queue flushes
all over the code) to solve the problem. It seems to me, that the only place
where we can think about some reasonable solutions is right where the problem
occurs - at the server.
I don't see anything wrong in stealing the idea from the terminal handlers
which simply ring a bell when the buffer overflows. How about if the server
would simply freeze the pointer, or better yet - change the shape of the pointer
similar to the watchband - within the windows of the application which
is to receive the events which are going to be dropped. In addition, make
sure that any mouse clicks, keyboard inputs or such actions directed to that
application result in some easily identifiable response, maybe something
like rining the bell.
I don't know how complicated it would be for the server to implement such
functions - but the concept definitely seems simple enough....
|
60.61 | LostEvent | STAR::BRANDENBERG | Intelligence - just a good party trick? | Wed Feb 15 1989 14:41 | 119 |
|
re .61: Beautifully stated. The mapping of application flow control
onto transport flow control is precisely the problem. Transport
implementations have different flow control and so X appears to operate
differently on different transports. However, the fault lies with the
protocol design and server semantics.
(An aside: it is truly a pleasure to be talking to people other than
myself for once. Thanks to all for your participation.)
I'll conclude from the replies that true reliability and not probable
reliability should be a goal for a DECwindows server. I concur with
Paul Koning's conclusion that this implies that the server may drop
data it attempts to send to a client. There are two types of data
which a server may send to a client: replies and events (errors are
encoded as events). What is the "best" way to handle each type?
I'll consider replies first. A reply is generated in response to some
client request and so there is some indication that the client will try
to cooperate with the server. But what if sending some reply should
block? I see three possible responses:
1. Drop the connection at the first sign of congestion. This
certainly guarantees that the server never hangs but it isn't
really reliability. The protocol will in theory allow replies
to be as large as 16GB. How well the client is able to sink
the reply data from the server will depend upon what is being
sent, how well the network is operating, whether page faults
are being serviced, the relative speeds of the client and server
machines, how the client is being scheduled, and a host of
other factors. The client may be trying to read the reply but
the program environment, which it can't control, may not allow
the client to keep up with the server.
2. Guaranteed transmission of replies. This will ensure that any
"best effort" client will receive a reply but now the server
will hang until a reply can be buffered by transport. I've
already given examples where this time may be unacceptibly
large.
3. Best effort attempt. How long can we allow a server to hang
in an attempt to transmit solicited data to a client? Decide
this and use it as a timeout on reply transmission. This
doesn't give 100% reliability *but* we now can quantify the
amount of time a client can take to read a reply if it wants
to retain its connection. I prefer this choice.
Onward to events. What is done here will have far-reaching
implications on the whole decwindows engineering effort. There are
some applications whose compute tasks are so large relative to the user
interface component that they won't mind rebuilding the interface
should something be lost. Other's will be mostly user interface and
will want to do as little as possible to recover from a gap in event
transmission while still being reliable. The most interesting of this
latter type, I believe, is the toolkit itself.
With that in mind and a predilection towards the "lost events" event
and some experience with the intransigency of those who control the
protocol, I'll consider that possibility. Review the protocol manual
and read the x.h and xlib.h files to see what kinds of events are being
generated and what they cause. It was suggested that mouse motion
events are the primary candidates for encountering congestion but they
are not all. Mouse motion can also generate Enter/LeaveNotify events
for windows up and down the heirarchy. The offending mouse motions
could have been part of a button down sequence so that not only are
mouse motions lost but also that all-important button-up event. Add
grap/ungrab events to this mess, also. Then, there are the
ConfigureNotify and Expose-type events. These are usually caused by
another client (the window manager) and failure to respond to these
will certainly cause ugly holes. Also, some of these events are
"counted" events. I.e., they contain fields which count down the
number of events which an application may *reliably* expect but which
may be lost in the new system. Then, there is the Brave New World of
events defined by unimagined extensions. The amount of state that
needs to be kept and transmitted to the client isn't that small.
Basically, an application must be able to enquire as to the exact state
of the server or at least be able to return it to a known
configuration.
Without thinking too hard, I'll take a stab at such an event and what
kind of support it will require. It is probably interesting to know
what type of events were lost. There are 128 possible events (256 if
one wishes to distinguish "natural" events and those sent with
XSendEvent). 128 bits requires four longwords. The event header is
another longword. Since this event encompasses a range of activities,
the application may want to know how long it was out. If so, reserve
and additional two longwords for either timestamps or full sequence
numbers to indicate when events began to be lost and when event
transmission (always beginning with this event) resumed. We now
have seven longwords of an eight longword event packed used. The
remaining longword could be used for modifier and mouse button state at
the time the event lost event is transmitted. (NB: This event is
connection-wide: it does not associate with any one resource.)
What kind of response might a client want to take on receipt of such
an event? It could:
1. Give up. Currently, the server does this for it but now a
client will have to do it itself.
2. Update everything. This means repaints, dropping active and
passive grabs, fixing keyboards if they've been changed, etc.
Consider all the resources that may be involved and this may be
a very time-consuming task (as in the case of the toolkit).
A review of the Xlib interface is needed to ensure that we
*can* restore to a known state.
3. Intelligent/Selective Update. The application needs to perform
query operations to see what what changed. We may need new
protocol requests to query the window layout (as visible not as
defined by the application), GC's, colormaps, and other
resources. Extensions must provide equivalent functionality.
Additional work is needed in the server up through every
application.
Comments?
Monty
|
60.62 | | KONING::KONING | NI1D @FN42eq | Wed Feb 15 1989 14:51 | 41 |
| .63 says part of what I was in the process of replying to .62...
Re .62: It's not that simple: there may be multiple clients using the same
server. (In fact, there just about always are multiple clients.) The property
you MUST have is that one client's lack of progress does not block other
clients. So you can't simply stop accepting keystrokes, or mouse motions,
or whatnot, since some of those inputs may be going to clients that are
operating correctly. And of course some events are generated by the
actions of other clients: if client A deletes a window, client B may
receive an exposure event. Clearly it would not be valid to prevent A from
deleting that window.
If events have to be discarded, and the events are input (keystroke, mouse)
events, then a bell or some similar feedback may be a good idea. But
with or without that, I believe an "events lost" indication is essential.
If an application wants to take a head-in-the-sand attitude it can simply
ignore such events, though this would tend to result in low quality
applications. A full repaint (treating events lost as a full exposure event)
is probably the minimum that makes sense. As .63 points out, restoring
ALL the state may take a lot of work. There's probably a subset of the
state that could be restored efficiently; something on the order of what
is restored on deiconize. (Then again, I may simply be showing off my
ignorance of the complexities of X here.)
As for the suggestion to provide some more detail on the events-lost
notification (e.g., classes of events that were lost): that might be useful,
though I suspect most applications wouldn't make use of that. The fact
that anything at all was lost would be grounds for recovery actions; since
the application isn't supposed to be falling behind and losing things as a
normal operating mode, you wouldn't want to make those recovery actions
all that sophisticated. There's a rule about "this shouldn't happen" type
of error recovery code -- it says that such code in fact doesn't work in
the field, since it's not tested during field test, certainly not in all
its permutations. This argues for keeping the lost-event handling code
simple, since in most applications and most configurations it should be
rare. (Another way to justify that it should be rare is that this event,
when it occurs, disrupts the user interface. So a human factors argument
says that it must not occur often.)
paul
|
60.63 | Getting back to the problem, if not the subject | POOL::HALLYB | The smart money was on Goliath | Wed Feb 15 1989 16:54 | 20 |
| Designing protocols can be fun, and you guys are doing such a great job
that I don't need to make any contributions. But I am worried about what
appears to be the harder problem -- those long-running applications that
don't want to change their code. It seems to me that if you have a
developer who's going to make use of "lost events", repaint the screen,
clean up etc., then you probably have a developer who's going to write
good enough code so that the problem doesn't arise in the first place.
But what do we do about the application that goes in to a black hole and
ignores events for a long time? Should we provide developers with real
fast test-1-bit type instructions (if set, call the event queue processor)?
Or should we provide some way (like ASTs, but not ASTs) to sort of force
a reluctant application to process events?
It should be OUR desire to make DECwindows such an attractive system that
ISVs will want to use it. Forcing X calls into application loops isn't
the way to advance that cause.
John
|
60.64 | | PSW::WINALSKI | Paul S. Winalski | Wed Feb 15 1989 17:24 | 39 |
| RE: .65
I don't see where the case you're referring to is a problem. Suppose we have
an application that does some DECwindows setup, then calls a subroutine that
does several day's worth of number crunching, ingnoring the X event queue the
whole while. Upon leaving that subroutine, it updates the DECwindows screen.
What happens now is that if the event queue fills up, the server drops the
connection and the application bombs once it leaves the subroutine.
If the "events lost" event is added, the application can find out that this
has happened, if it wishes to, and can take corrective action. If the
programmer ignores the "events lost" event, then it's possible that there might
be misbehavior. So what? At least with "events lost" events, this sort of
application can recover from lost events. With the current protocol design, it
cannot. Note also that the check for events doesn't have to be in the number-
crunching subroutine this way--we haven't forced the programmer to turn his
application inside out.
Regarding getting this change accepted by the X Consortium--the "events lost"
event seems to be in keeping with the general X philosophy of pushing work back
on the application. Just as an application must decide if exposure events are
significant and if so, process them, "events lost" events put the decision on
whether event buffer overflow is significant in the hands of the application,
not the server. If the application chooses not to handle such events, it can
either ignore them or abort. One would expect the DECwindows Toolkit to recover
from such events, of course.
A properly-designed server that is supposed to handle more than one client
simultaneously should never let itself get into the situation where a flow
control problem with one client blocks the entire server. This can be done
without the server imposing any kind of timeouts on client connections. Link
breakage detection and timing out of connections should be the job of the
underlying virtual circuit transport on which the server/client communication
is based--it should not be done by the server itself.
--PSW
|
60.65 | I think I can answer the question about why TCP is less affected | RIGGER::PETTENGILL | mulp | Wed Feb 15 1989 20:37 | 34 |
| TCP (it doesn't need to be IP, but usually is) is byte stream oriented. This
means that the application needs to provide its own record framing (which isn't
usually much of an issue) and `interrupt messages' are sort of a kludge (if
you don't have records, how do you know where to insert data that is supposed
to skip to the front of all other data without hopelessly confusing the
application). However, X fits TCP well (for hopefully obvious reasons) since
it has its own `record structure' and it doesn't use interrupt messages.
So, how does this help ?
Being a byte stream protocol, TCP is geared to handling a byte stream. It's
flow control units is bytes, not records, and it gets to decide for the most
part when to transmit data based on either a timer or some fraction of its
buffer being filled. When it transmits a datagram (usually IP but not required)
the datagram includes the starting byte in the current window and the number
of bytes in this segment. On the receiving end, the message must ALWAYS be
processed, even if (some of) the data has been received (and passed to the user
and acknowleged) before. This means that when a TCP connection has a byte
quota of 6000, the connection won't stall until all 6000 bytes of the buffer
are filled. It is possible to write 1 byte at a time to the TCP socket and
without any ack from the other end, send 6000 datagrams ranging in size from
1-6000 bytes long (IP datagrams can be as large as 8kb). TCP doesn't need to
keep around 6000 copies of the datagrams in actual or virtual format to operate.
As I understand the VMS DECnet implementation, the pipeline quota in bytes
is simply used to compute the number of outstanding datagrams that will be
used. Something along the lines of 10000/576 -> 18 datagrams. If a write
is done for 1 byte, then one datagram is used, and it is possible for 18 bytes
of data to consume the 10000 bytes of quota. I'm being extreme, but in the
case of mouse events, I expect that no matter how fast a user is, each click
results in a very small amount of data (25 bytes) being written which will
be sent in a separate DECnet datagram. (As I said, I don't understand this
well, or maybe not at all....)
|
60.66 | Oops, I missed the obvious on TCP | RIGGER::PETTENGILL | mulp | Wed Feb 15 1989 22:32 | 36 |
| I just did a little checking of what messages actually get sent and realized
that I missed the obvious about TCP.
The receiving end of TCP doesn't need to worry about keeping track of record
boundaries, so it can simply stuff everything in one buffer. In the case
of the VMS Connection, it normally has a receive and transmit buffer size
of 4096. After establishing a connection (DECW$CLOCK) and then making sure
that the client would not process any events (^Y) I generated events and
watched how the server system sent about 60 datagrams which filled the 4096
byte buffer on the client system (average of about 70 data bytes each). Then
I watched as the 4096 byte buffer on the server filled. About 30 seconds after
the server buffer filled, the server killed the connection to the client.
Each datagram was ack'd by the client system.
Until, both buffers were filled, the server continued to function normally.
In contrast, when I did the same using DECnet/VAX, the server sent about 18
datagrams (each ack'd) averaging about the same size as the TCP datagrams
and then the server stalled. About 30 seconds later the server terminated the
connection. The DECnet system had a pipeline quota of 10000.
This suggests a partial solution; since DECnet won't make efficient use of
its buffers (ie., using 1500 bytes to store 60-70 bytes of data), the DECnet
transport module needs to do it. On the client side, it could do its input
I/O with ASTs and read into a buffer from which it passes data to the Xlib
code. As long as its is able to get ASTs, it will be able to keep the server
happy until its buffer fills. Similarly, on the server side, it needs to make
sure that it nevers stalls and when DECnet won't accept any more data, the
transport module needs to move the data into its own buffer.
This is certainly a hack, but I believe that it would only need to be an
interim hack until the Phase V interface becomes available; I'm guessing,
but I suspect that it may help in this area. If not, then this is the kind
of info Tom Harding et al were looking for a few months back when they were
asking what advantage a stream interface offered and should they support one.
|
60.67 | Some events are more equal than others | DSSDEV::TANNENBAUM | | Wed Feb 15 1989 22:45 | 29 |
| Even with the proposed changes, DECwindows would still be missing an
important feature available in the terminal world. Applications
aren't always well behaved. If my application runs away, I want (need)
some way to get control of it without necessarily blowing away the
process. I may have invested a lot of time and effort in my current
application state. I want to save it if at all possible.
Even if the a "lost events" event is added, TPU will still need to poll
the input queue periodically to check for ^C's. It's too easy to put a
TPU-based application into an infinite loop. For example, type
TPU a := 0; LOOP a := a + 1; ENDLOOP
at EVE's command prompt and watch TPU count to infinity.
Our first attempt at dealing with this resulted in our asking XLIB
for an AST for any keyboard character. Performance was abysmal.
Users type *lots* of keys at a text editor. Currently we have an AST
that checks the input queue once a second (XLIB can be called at
AST level) and sets a flag if there are any events pending. At
the top of our interpreter loop, we check the flag and call a routine
to dispatch any pending events if it is set. (The tool kit can
only be called from non-AST level)
Imagine trying to debug an application that goes into an infinite loop
without being able to type ^Y DEBUG...
- Barry
|
60.68 | Events lost event rejected by MIT in the past | STAR::BMATTHEWS | | Thu Feb 16 1989 05:24 | 5 |
| An events lost event was proposed to the X11 developers during X11 development
and it was rejected so I am not sure how likely it is to get this into the
protocol.
Bill
|
60.69 | X12R1 | STAR::BRANDENBERG | Intelligence - just a good party trick? | Thu Feb 16 1989 09:40 | 28 |
|
Getting such a change accepted by the Consortium is going to be a huge
task. This change is more than just adding a new event packet. Here
are some of the implications:
1. Throw out the event section of the protocol manual (which is both a
protocol *and* a server specification). In the future, event delivery
becomes unreliable and so there will be no guarantee the count fields
will be honored or bracketed state changes be undone (such as
buttonRelease after buttonPress, ungrab after grab, etc.). An
"eventsLost" event will require an application to either give up or
recover server state, much more involved than exposure handling.
2. Toolkits, Widgets, and any other programming or environment tool
will have to handle this event gracefully if an application using it is
to be reliable. And not just DEC's toolkit: Athena's, HP's, and every
Tom, Dick, and Harry, Inc. that makes an X Windows System.
3. Rewrite event handling in applications. All applications.
Everybody's applications.
Item 1 is a significant enough change to call this "X12." The
Consortium will have to be pushed *hard* (or off a cliff) to get this
change accepted. After all, how many programmer's care about reliable
systems?
m
|
60.70 | | STAR::BRANDENBERG | Intelligence - just a good party trick? | Thu Feb 16 1989 09:44 | 13 |
|
Re .68 "partial solution":
This is part of what common transport attempts to do. There is
obviously a tradeoff between ability to achieve a steam-like appearance
and the CPU cost of performing the data copies. I chose a point that
leaned too far towards performance and not enough towards streams. I
currently have some transports running that perform more copying on
writes and this has improved reliability. The performance impact is
not yet known.
m
|
60.71 | | KONING::KONING | NI1D @FN42eq | Fri Feb 17 1989 11:19 | 32 |
| Stream transports can make the problem appear less quickly, but clearly can't
eliminate the problem.
Re .71: the impression I get is one I keep getting over and over from certain
areas: that high quality is a non-goal. "Good enough for programmers" is
all that is considered necessary. UGH. I also don't think the arguments
hold water. "Event delivery becomes unreliable." Sure -- but it already
IS unreliable. In all the cases where it is reliable currently, it will
continue to be reliable. In all cases where it is currently so unreliable
that it blows the application completely out of the water, it continues
to be unreliable. The only difference is that the error is no longer a
fatal error but one that applications can, if they wish, recover from.
Currently, the error is fatal and applications are not given the option
to recover no matter how much they may want to.
There is no compatibility problem. Any application that ignores the
event will not be any worse off than it is now. Depending on what it
would have done had it not chosen to ignore the event, it may be very
much better off. Any application whose developers take the trouble to
do some work to process the event is improved in the process.
In other words, you can't lose. It is an absolute improvement for every
application.
Re .65: what to do about applications that don't want to redesign their
code to guarantee that events are processed quickly enough to avoid event
loss: that's where a proper multithread support will help. Put the
application in one thread, the event handler in another, and you're done.
(Well, close, anyway...)
paul
|
60.72 | You can't poll for events often enough, ever | PRNSYS::LOMICKAJ | Jeff Lomicka | Fri Feb 17 1989 13:26 | 15 |
| After what happened to me yesterday, I am convinced that the current X
transports cannot be made reliable on VMS unless you check for X events
between EVERY INSTRUCTION, perhaps more often than that.
You see, yesterday a machine running a client of my workstation went
into a long cluster transition. Need I say more? I will anyway.
I beat on the keyboard and mouse a bit, and sure enough, my entire
STAND-ALONE workstation was hung until the server decided to trash the
offending client, then I could proceed.
My gut reaction to this entire discussion is "how could anybody be so
ignorant as to ignore the flow control problem here".
|
60.73 | | STAR::BRANDENBERG | Intelligence - just a good party trick? | Fri Feb 17 1989 13:45 | 18 |
| re .74:
>My gut reaction to this entire discussion is "how could anybody be so
>ignorant as to ignore the flow control problem here".
Say one of the following in a whining, geekish voice:
1. "It's too haaaaaard to solve."
2. "*I* don't have any problems; there must be something wrong with
the user or programmer!"
3. "What flow control problem?"
4. "Zzzzzzzzzzzz. Snort."
monty
|
60.74 | | VWSENG::KLEINSORGE | Toys 'R' Us | Fri Feb 17 1989 13:58 | 6 |
|
As one of the x11-high-and-mighty started a mail message to me two
years ago: "Any competent programmer..."
|
60.75 | it can be done compatibly | PSW::WINALSKI | Paul S. Winalski | Sat Feb 18 1989 15:19 | 51 |
| Sorry, but the line of reasoning in .71 is faulty. Taking the points in order:
1) Addition of an events lost packet does not mean that event delivery becomes
unreliable. As Paul Koning pointed out, event delivery already IS
unreliable. Events lost continues to be an error condition, as it is today.
The only difference is that an events lost packet lets the client decide
if the condition is severe enough to warrant aborting the connection. Today
the server decides unilaterally that the condition is always fatal.
Applications that cannot deal with the condition for any of the reasons
that you cite are perfectly free to handle an events lost condition by
aborting the connection. The difference here is that those clients that CAN
handle the condition are able to do so.
2) Toolkits *should* handle the condition gracefully to be of maximum service
to the user. Those that choose not to handle the condition gracefully will
offer service that is exactly like it is today.
3) The change can be done compatibly, with no rewrite required in existing
applications. The way to do this is to make enabling events lost
notification optional. This could be done in either of two ways:
o add a routine, call it XSetFlowControl(). This would be analogous to
XSynchronize(). When you enable flow control, the server will try to send
events lost packets instead of aborting the link when buffering capability
is exhausted. If flow control is not enabled explicitly, you get the
current behavior.
o enabling delivery of events lost notifications via XSelectEvent causes
the server to send such events instead of aborting the link when buffer
space is exhausted.
Either of these methods would be completely upward compatible with current
behavior since an application must explicitly ask for events lost packets
to be sent, otherwise you get today's behavior.
It should still be a guiding principle of the design of X transports that they
do whatever possible to avoid getting into the situation where either a packet
must be droped on the floor or the server/client connection aborted. However,
it is a fundamental fact of life in protocols of this sort that data loss due
to buffering capacity being exceed can and will occur. The trick is to find a
reasonable way to handle the situation. X's current method--terminate the
client link--is Draconian but effective. Allowing an application to receive
an events lost event, if the application so chooses, puts the decision whether
to abort the link in the hands of the client rather than the server. This
seems to me in perfect keeping with the general X philosophy of not having
the server make policy decisions that are better made elsewhere. Who knows
better than the client itself whether the situation is recoverable at the
client end?
--PSW
|
60.76 | devil's advocacy | SSAG::GARDNER | | Sun Feb 19 1989 19:56 | 23 |
| I know absolutely nothing about the X protocol per se, so maybe this is
off the wall. But it's an obvious enough question that it needs to be
asked.
Why can't termination of the connection be treated as an "events lost"
notification? When a connection is broken, why doesn't the application
and/or toolkit try to re-create the connection and, if it succeeds, do
whatever it was going to do to recover from "events lost". When events
are lost, the state of the application's windows, etc. are essentially
indeterminate; it should probably re-construct or refresh them from
scratch (a previous response suggested doing this regardless of any
shortcuts that might be possible). Why can't it just re-create them on
a new connection?
If such an approach is plausible, it has the advantage of being totally
compatible with the current X protocol. Plus it avoids a potential
pitfall of adding "events lost" notifications. Suppose an application
crashes somehow without explicitly tearing down the connection. My
impression is that there's no convenient way for me, from the server,
to abort the connection and recover the server resources that are
devoted to it. The server might merrily preserve the connection
forever, discarding events as necessary.
|
60.77 | | PSW::WINALSKI | Paul S. Winalski | Mon Feb 20 1989 01:17 | 6 |
| You can't just reestablish the connection because all of the windows, graphics
context, etc. associated with the connection is destroyed by the server when
the connection goes away.
--PSW
|
60.78 | Why not make it a Extension ? | LESZEK::NEIDECKER | Dont force it,get a bigger hammer | Mon Feb 20 1989 01:57 | 11 |
| Re. 70-71:
If it is so hard to get thhis additional event accepted by the
consortium, why don't we make it into a extension package that
DECwindows servers support ? If it turn out to be the solution, we
have a bonus, if a server doesn't know the extension, our clients
(the Toolkit, etc.) falls back to whatever it does today (e.g. nothing).
Should be little registration hassle ?
Burkhard Neidecker-Lutz, Project NESTOR
|
60.79 | | SSAG::GARDNER | | Mon Feb 20 1989 12:47 | 11 |
| > You can't just reestablish the connection because all of the windows, graphics
> context, etc. associated with the connection is destroyed by the server when
> the connection goes away.
But doesn't the toolkit/application have a representation of that
information in the various toolkit data structures? Since unknown
events have been lost, you have to walk these data structures anyway to
restore the windows, graphics context, etc. on the screen. To this
(possibly naive) observer, it doesn't seem significantly harder to
re-create the objects first.
|
60.80 | | PSW::WINALSKI | Paul S. Winalski | Mon Feb 20 1989 15:44 | 10 |
| If you lose events, the server still has windows and graphics context for the
application. It's just that they may not be quite in the state that the
application thinks they are in. If you break the connection, the server throws
away the windows and graphics context completely. If the connection goes away
and then the application establishes a new one and restores things, the user
will see the application's windows actually disappear from the screen and then
come back again.
--PSW
|
60.81 | If you can't solve the problem, avoid it ? | STAR::MANN | | Mon Feb 20 1989 20:02 | 17 |
| If the server detects that the user is trying to select a
stalled session, why not just display a skull and crossbones ?
This method:
1 - Give the user appropriate feedback
2 - Prevents the server from entering a (temporarily) deadlocked state
3 - Prevent the application from being needlessly aborted
4 - Does not involve any X protocol changes
Ever notice the terminal driver lock your keyboard ? Guess
what it would have to do with that character if it let you
type it ?
Or is the X server code unmodifiable in this manner ?
Bruce
|
60.82 | | PSW::WINALSKI | Paul S. Winalski | Mon Feb 20 1989 21:37 | 8 |
| It's more complicated than selecting a stalled session. Suppose you push a
window. That could cause a string of exposure events, some of which can be
sent and others of which can't because the buffer space was exhausted. It's
hard to tell before the operation occurs that it could cause somebody to
overflow buffer space.
--PSW
|
60.83 | complicated, I believe | STAR::BRANDENBERG | Intelligence - just a good party trick? | Tue Feb 21 1989 12:40 | 52 |
|
Sorry, but the line of reasoning in .77 is faulty. Consider the
following extract from page 76 of the X11, Release 2 Protocol Document:
For a given "action" causing exposure events, the set of events
for a given window are guaranteed to be reported contiguously.
If count is zero, then no more Expose events for this window
follow. If count is non-zero, then at least that many more
Expose events for this window follow (and possibly more).
Implications of adding a "LostEvents" event:
1. Protocol and semantics change. Count is more like a "hint" than
a reliable value.
2. Application programs change. Certain coding constructs are no
longer acceptible. For example, an event handling routine may switch
on the eventType in an event packet to execute code such as:
switch (ev.type) {
case Expose:
/*
* dump extra expose events
*/
for (i=0; i<ev.exposeEvent.count; i++) XNextEvent(dpy,&dummyEv);
/*
* do generic exposure handling
*/
do_exposure();
break;
This code is correct under the current protocol and server semantics
but is incorrect after the suggested protocol change is made.
3. Scope of "LostEvents" Event burdens *all* applications. For
reasons of implementability in the server, this event should probably
be associated with a connection and not one-or-more-per-resource as are
many events. (In this respect, it is much like the unmaskable events.)
So, if an application were to turn on this event, any toolkit it used
would also see the event. Or, if any toolkit wanted to receive this
event and turned it on, it would be turning it on for the application.
Is everyone prepared to handle this event even if only to ignore it?
The addition of a "LostEvents" event is necessary. An xlib request
similar to Paul's XSetFlowControl (or a point revision to the protocol
version sensed at connection setup time) may be desirable. But these
two alone are not sufficient to make X reliable. Event processing and
generation *does* change. And, do we have the functionality that allows
an application to recover relatively conveniently from such an event?
monty
|
60.84 | Some additional thoughts... | IO::MCCARTNEY | James T. McCartney III - DTN 381-2244 ZK02-2/N24 | Tue Feb 21 1989 14:10 | 16 |
| RE: .83
Whether the terminal driver locks the keyboard, or simply throws away any
character for which it does not have buffer space causes identical results.
In either case the datastream being sent to the host is interrupted and data
is lost. The keyboard being locked does not physically prevent data from being
transmitted on the line, nor does it stop an operator from continuing to stike
keys. Although I agree with the behaviour of the terminal driver, it is not an
adequate model for solving the problem inherent in X. The terminal driver does
not provide the needed "data lost" indication.
I like the idea of a skull and crossbones, especially if it was imaged inside
of a solid black locator cursor.
James
|
60.85 | Ok, how about this ? | STAR::MANN | | Tue Feb 21 1989 20:51 | 20 |
| >It's more complicated than selecting a stalled session. Suppose you push a
>window. That could cause a string of exposure events, some of which can be
>sent and others of which can't because the buffer space was exhausted. It's
>hard to tell before the operation occurs that it could cause somebody to
>overflow buffer space.
When a session stalls, immediately shrink it to an icon (automatically)
and queue the event/message to it which caused the stall in an "overflow"
buffer(s) (and display a skull and crossbones). Now it cannot become the
recipient of exposure events, can it ? If the session unjams, send the
overflow buffer(s) and resume normal operation.
"buffer space exhausted" is a policy, right ? The workstation has not
run out of memory ! Transport is simply advising it that it is no longer
sensible to send messages because they cannot be delivered just now. Just
reflect this condition back to the user in a way that prevents the user from
continuing that session in a non-discretionary manner (make the user use his
memory).
Bruce
|
60.86 | can the server do that? | AITG::DERAMO | Daniel V. {AITG,ZFC}:: D'Eramo | Tue Feb 21 1989 23:19 | 9 |
| re .87
>> When a session stalls, immediately shrink it to an icon (automatically)
Isn't it the window manager (i.e., another client) and not the
server that knows about things like icons and where they go?
Dan
|
60.87 | Some sleight-of-hand, a little smoke and mirrors.. | POOL::HALLYB | The smart money was on Goliath | Wed Feb 22 1989 09:57 | 7 |
| > Isn't it the window manager (i.e., another client) and not the
> server that knows about things like icons and where they go?
Maybe the server could send a message to the wm saying "put this guy in
the drunk tank". Come to think of it, the icon box icon looks a bit
like a jailhouse window...
|
60.88 | | VWSENG::KLEINSORGE | Toys 'R' Us | Wed Feb 22 1989 10:09 | 64 |
| Let's look at what the terminal driver (VMS), terminal (VT200) and
a random application do...
Terminal data comes in and the terminal class driver puts
the character data into a typeahead buffer and completes
a outstanding read if the conditions of the read are met.
If the typeahead buffer contents reaches a certain degree
of 'full', the class driver tells the terminal to shut-up
(XOFF). It will still accept data until the typeahead is
full at which point it drops any further data and returns
a DATAOVERRUN when the typeahead is finally read.
When the terminal gets the XOFF, it *also* (at least on VT200's
and VT300's) has some amount of buffering and will buffer
transmit data until *its* buffer is full at which time it
sets the WAIT LED and drops transmit data.
The application is ovblivious to all this. Periodically it
reads the typeahead buffer and only knows about any of this
when and if it gets a DATAOVERRUN message.
Extend this to the X11 world:
First, this implies that the client software which manages the
connection gets asyncronous notification of a event and moves
the packet from the transport to the clients event queue (i.e.
this operation is not a side effect of a processing loop in
user mode!). This software sends a message to the server when
the clients event queue reaches some degree of 'full' telling
the server to shut-up (XOFF). It of course still accepts new
events until it runs out of free packets at which time it
starts dropping events and begins to build a event-lost client
structure.
The server, sends event packets off as long as it hasn't been
told to 'shut-up'. By a combination of local buffering on a
per-connection basis by the server and the 'slop' in the client
side event queue after a XOFF, the event "counts" should always
remain valid, that is, even if the server is XOFFED after having
sent the packet with the "count", the combination of the client
bufering to soak up packets already "in the pipe" and the server
buffering of unsent packets would deliver all the packets promised
(so it doesn't need to change the meaning of the count to a
'hint'). If and when the server runs out of buffering, it starts
building a server-side event-lost structure that will be used
to build a event lost event when the client starts taking input
again. This implies that the server is smart enough not to
send a counted event during a XOFF if there is not enough local
buffering available for all the event packets.
When the server reaches the point that it is discarding events,
it *can* do some visual 'hints' that it is stalled including
ringing the BELL for KB input, turning off autorepeat on the
KB, Setting the WAIT LED on the KB and changing the cursor shape.
Since *all* of these can be restored to a proper state once
the error state is corrected.
Now, all of this is probably meaningless, because I've got this
nasty feeling that the client-side input queue is built as part
of the polling loop, and otherwise the data just stacks up
uncollected in the transport buffers (DECnet, whatever).
|
60.89 | I ruminate? | VINO::WITHROW | Robert Withrow | Wed Feb 22 1989 13:31 | 57 |
| Im not a Xwindows maven, so dont yell at me. I'd like to categorize the
later portions of this note (which seems to have migrated somewhat from
the original topic). I will only be speaking in ``broad conceptual'' terms.
It seems to me that there are two concerns: 1) What should happen when a client
is sourcing events faster than the server is sinking them, and 2) what should
happen when the server is sourcing events faster than a client is sinking
them.
In case (1) it seems that most participants think it is fine if the
client is forced into quiesence (forced to nap) until the server has
caught up with it. Seems reasonable to me. Nothing is lost and other
clients are not affected.
In case (2), one can not force the server to nap because that will affect
other clients. A previous reply suggested a skull and crossbones cursor,
etc. Other objected that information is getting lost. Comparisons with
terminal handlers were made, etc... Can we take this in parts?
a) Does everyone agree that (2) is a ``policy'' issue? I mean, it's nice
to claim that a client should always be able to sink events at least
as fast as the server can ever source them, but I don't think that is
possible since one can never have infinite buffering. Lacking infinite
buffering one must have flow control, and that seems to me to mean
``flow control policy''.
b) It seems that flow can be controlled in several places and in many
different ways. Suggestions have been: Implement flow control in
X protocol, possiblly by throwing excess events into the can and telling
the client we did that; Implement flow control in a lower layer, and if
(2) happens take drastic action (which seems to be what is done now);
Implement flow control in the server by refusing to send events until
the client catches up. Are there more?
Since (I hope we agree) this is a policy issue, I guess I would like to
see it resolved in the server, since I feel that it is rude of the server
to bombard the client with events, and, in the interrest of robustness,
I would prefer to assume that the server is smarter than the typical
client (and thus should be able to restrain itself). Also, it is a
single point solution that does not require every
single client to worry about what to do with a rude server (servant?).
To that end, it seems reasonable to handle (2) this way: When the
server discovers that it is sourceing events faster than a client is
sinking them it should: a) Ignore all user input into the window(s) associated
with the client (Perhaps it should beep for keypad input, and should
turn off the mouse pointer when it enters the window), and b) not send
exposure events to the client. If the server does save-unders it would
be free to repaint exposed areas itself from its backing store, otherwise
it should just leave the ugly holes in the window.
Later, when the client catches up, the server should again allow user input
in the windows, and (if it wanted to send any exposure events but couldnt)
send an exposure event for the entire window.
Like I said, Dont yell at me!!!!! ;-)
|
60.90 | I rusticate | STAR::BRANDENBERG | Intelligence - just a good party trick? | Wed Feb 22 1989 15:13 | 177 |
|
Re .91: I'll talk some more...
I agree that this is *at least* a matter of policy but may also be a
matter of protocol and server specification. (My earlier reference to
the interpretation and generation of expose events is sufficient to
make the latter true.)
I also accept the policy on case (1) where the client can't send to the
server.
Now, as for case (2), you've summarized the possibilites as being:
a. Drop events and generate a "LostEvents" event when possible.
b. Drop the connection.
c. Drop events but don't give any indication to the application
(there may be user/device feedback, however).
If reliability, at least as I understand the term, is a goal, then b.
is clearly unacceptible. If either a. or c. is chosen, the protocol
and server specifications still must change (see previous discussion on
the interpretation and generation of expose events). Furthermore, I
believe that c. is *extremely* unfriendly to the application. It
doesn't find out that it has lost events until it either receives
information on the server state that is inconsistent with its model of
the server state or the user tells the application, via a "fix-up"
request, that it is confused. Consider a window manager in
window-resize mode: it has grabbed the server, it's receiving mouse
motion events to perform stretchy-box operations but the server drops
some number of mouse motion events *and* the upclick of the mouse. At
what point does the window manager find out that information was lost
so it can ungrab the server and return to a safe state?
Now, I'll jump into a policy definition for all data sent from the
server to the client. Keep in mind the following things:
1. Any client can send any event to another client with
XSendEvent().
2. Clients interact with other clients as a natural part of
operation. One client's requests may result in any number of
events being generated for any or all of the other clients.
3. Extensions. Always remember extensions. We don't know
what they'll look like or how they'll define their own
events, if they do at all, and use whatever policy we
establish.
4. Certain events/state transitions currently guarantee that
certain other events will be sent at a later time. If event
delivery becomes unreliable without disconnecting a link,
these "guaranteed" events may not be received by the client.
The following "#define"'s are taken from x.h. They represent the
*currently* defined event codes. I've also included replies (type
code '1') and errors (type code '0').
#define X_Reply 1
Reply to request issued by client. Unlike events and errors which are
always 32 bytes, this may range in size from 32 to 2^34 bytes. There
is some indication that the client will try to read data but should the
server wait unconditionally for a slow or hung or thrashing or
malicious client? I suggest a configurable parameter specifying a
timeout for reply operations, probably on the order of 5-10 seconds.
If the client doesn't respond, disconnect.
#define X_Error 0
Some request generated an error. Errors generated by asynchronous
requests are asynchronous, those generated by synchronous requests
(i.e. those expecting replies) are synchronous and the event is sent in
place of the reply. If the latter case, errors should be treated as
replies and the timeout should be used. If the former case, they could
be treated as either replies or as events (they may be dropped).
#define KeyPress 2
#define KeyRelease 3
#define ButtonPress 4
#define ButtonRelease 5
#define MotionNotify 6
Indicates that a keyboard key was pressed or released, a mouse button
was pressed or released or that an "interesting" motion of the mouse
occurred. With unreliable delivery, release and press events may not
match up. If the Button?Motion masks had been used in requesting mouse
motion events, a stream of mouse motion data may suddenly stop without
any indication that a button had been released. Etc.
#define EnterNotify 7
#define LeaveNotify 8
Indication of mouse travel through the window heirarchy. With
unreliable delivery, any part of the traversal may be dropped so that
there will be no indication that the mouse passed out of, into, or
through some number of windows. This may confuse some applications.
#define FocusIn 9
#define FocusOut 10
Indication of change of input focus to some windows. Also traverses
hierarchy much like enter/leave notify so same caveats apply.
#define KeymapNotify 11
Report of state of keymap. Currently, when requested, it is sent after
every enternotify and focusin event and a client can rely on this.
With unreliable delivery, this event may be lost or the preceeding
focusin and enternotify may be lost thus creating an unexpected event.
#define Expose 12
#define GraphicsExpose 13
#define NoExpose 14
Previously discussed. Has a "reliable" count down field for contiguous
events. This no longer works with unreliable delivery.
#define VisibilityNotify 15
Sent to a client after hierarchy change operations. If lost, client
may not know that a part of the display is now visible.
#define CreateNotify 16
#define DestroyNotify 17
#define UnmapNotify 18
#define MapNotify 19
#define MapRequest 20
#define ReparentNotify 21
#define ConfigureNotify 22
#define ConfigureRequest 23
#define GravityNotify 24
#define ResizeRequest 25
#define CirculateNotify 26
#define CirculateRequest 27
#define PropertyNotify 28
*IMPORTANT* See the protocol specification. Used by window managers to
intercept application requests for hierarchy changes, etc. If these
are lost, the window manager will *REALLY* be confused. How are these
recovered?
#define SelectionClear 29
#define SelectionRequest 30
#define SelectionNotify 31
Selection events. Loss of these may mean that several clients think
that they own a selection or other problems.
#define ColormapNotify 32
Notification that a colormap has been changed. Window managers and
clients are interested in this. Loss of this event *will* prevent
colormap install oscillations. hahahahaha.
#define ClientMessage 33
Generic information from one client to another. Also used to "wakeup"
toolkit from AST level. Since this information cannot be recovered by
a request, who should receive an error if this can't be sent? The
recipient or the sender? Should this be made to execute like replies?
#define MappingNotify 34
Report that a modifier, keyboard, or pointer mapping request was
executed. Loss of event means that a client may use the wrong mapping
when it again receives input events.
There is more to flow control than just dropping data and repainting
windows later. THIS IS A BIG PROBLEM.
monty
|
60.91 | | PSW::WINALSKI | Paul S. Winalski | Thu Feb 23 1989 16:53 | 22 |
| RE: .92
I agree, it's a big problem. It's far too big a problem for the server to
arbitrarily decide for a client whether or not the situation is recoverable.
If a client receives a LostEvents event, it knows which events it had enabled
reporting for, and therefore what recovery actions have to be taken (if indeed
any are possible).
Receipt of a LostEvents event is an error condition. Any client is well within
its rights to treat receipt of this event as unrecoverable and abort the link.
For example, the window manager probably would abort upon receipt of
LostEvents, since the event that was lost might be CreateNotify, DestroyNotify,
or one of the other events that you cited. On the other hand, I have written
several applications that listen only to a small number of events and don't
really care if they miss one or more of them--if they are told that events were
lost, they can query the server as to the present situation, or for some events
(exposure, for example) they can assume the worst case and do recovery. Why
should these sorts of applications get terminated unconditionally by the
server?
--PSW
|
60.92 | Just say WAIT | CVG::PETTENGILL | mulp | Thu Feb 23 1989 20:53 | 28 |
| One solution would be to have the server clear the screen and display a big
WAIT whenever it became blocked trying to send to a client. However, that
might lead to a deadlock, or at least a situation where the user must wait a
long time for things to free up, so the server would need to watch for multiple
^Y's so that I could ask `Are you pounding on ^Y to abort the client?'
Seriously, I'm mostly kidding above. But now I'm not.
No scheme can prevent data arriving faster than it can be sent out and with a
user involved, you can't `flow control' a user so you are always going to be
faced with the possibility of data overrun. Therefore, it will be necesary
to discard data one way or another and somehow notify the only thing that can
deal with the problem in an intelligent fashion, the user. Currently this is
done by waiting for a while and then killing the connection and discarding
all the related data (and probably discarding some or all the data that the
user supplied while waiting) and then when the application clean up is done
the user is notified by the absence of his application and possible gets an
error message.
The proposal to send a `lost event' event is a compatible extension. If the
application can't deal with problem at all, or only sometime, then the behavior
is no different than today. If on the otherhand, it can recover, then it is
a big improvement. Note that the WM can recover totally, although the user
might notice the recovery. If you don't believe me, just stop the process and
then run it again. Everything will return to the way it was. Maybe its not
the best that one could ask for, but it is better not allowing the user to
continue at all.
|
60.93 | Window manager can't really recover... | DECWIN::FISHER | Burns Fisher 381-1466, ZKO3-4/W23 | Fri Feb 24 1989 12:28 | 15 |
| Just a nit about .94: The window mananger can't recover completely from the
situation that PSW mentioned: Loosing a MapRequest. In this case, the
client which issued the Map will just sit around forever thinking that it
got mapped, but not really being mapped. When the window manager makes its
"recovery" scan to figure out which windows to work on, it will never deal with
the "hanging" window, because it will assume that the client has not requested
that it be mapped yet.
However, having the window manager abort in this case does not help either.
This is a good example of the dilemmas faced when you try to break the
"reliable byte stream" assumption, though.
Burns
|
60.94 | | PSW::WINALSKI | Paul S. Winalski | Fri Feb 24 1989 14:07 | 11 |
| The point is that we DON'T have a reliable byte stream today. Should the window
manager or any other client get behind in processing events for any of a number
of reasons, the server will abort the connection and discard the queued
events. The only thing that a LostEvents event does is let the client decide
whether to abort the connection instead of the server. If the LostEvents
feature is left disabled by default and enabled by an explicit XSetFlowControl()
type call from the client, then the change is completely upward compatible.
--PSW
|
60.95 | | KONING::KONING | NI1D @FN42eq | Mon Feb 27 1989 15:44 | 26 |
| Not only do you not have a reliable bytestream now, you never did, and you
never will. Incidentally, the comment in .74 about "...transports...cannot
be made reliable on VMS" misses a key point: this whole discussion has
NOTHING to do with VMS, it has to do with fundamental and, I would have
thought, well known properties of distributed systems no matter what OS
they are built on.
I can see from the analysis in .92 that, for some applications, recovery
from EventsLost is harder than just repainting the windows. For many it
will not be, though. And clearly every one of them always has the option
to declare these to be fatal errors, in which case the situation is no worse
than it is now -- other than being that way by design rather than by
omission.
Something to consider: currently application die when this problem occurs
because the connection terminates (and they don't do Ed Gardner's "devil's
recovery"). If one were to change X by simply adding this event, without
adding the enabling stuff that PSW proposed, then many applications would
still abort since they don't recognize the event. Is that incompatible?
Then again, I probably can't get away with bitching about the X definition
of "reliable transport" and at the same time proposing this definition of
"compatible". :-)
paul
|
60.96 | More Fat for the Fire | STAR::BRANDENBERG | Intelligence - just a good party trick? | Mon Feb 27 1989 16:22 | 32 |
|
Here are some things to ponder (not directed at any reply, in
particular).
o If a "reliable server" is a requirement for some users in some
applications then do we need to provide a mechanism by which this user
can enforce this policy? The policy might state that a client either
accepts an eventsLost event or takes a quick disconnect on transport
jam. Or, it might state that *only* eventsLost clients are accepted.
If the latter, then sensing the type of client should happen at an
early stage of a connection, say, when the client transmits its
protocol level.
o I've argued that eventsLost processing is basically connection-wide
and not associated per resource. Hence, one part of an application
will unilaterally decide for the entire application what mode it will
run in. This isn't a problem for code that is newly written or that
will be reworked for a new version of decwindows. But we already have
a V1.0 product and, I assume, some sort of support commitment. If so,
it would be incorrect for a new toolkit library to enable lostEvents
for an old application or for a new, reliable application to rely on an
old toolkit library. How does the left hand keep up with what the
right hand is doing?
o The default event processing is not upward compatible with the new
lostEvents event. Most event processing will simply consume any
unrecognized events because there already exist events which cannot be
masked by XSelectInput so an application must expect the unexpected (to
some degree).
monty
|
60.97 | | KONING::KONING | NI1D @FN42eq | Mon Feb 27 1989 17:06 | 29 |
| I might want to have a policy that all the applications I use must support
EventsLost. But I don't see a way to enforce that in the way you
describe; that merely says that the application does something, but what
it does isn't necessarily sane. Essentially this requirement is one of
those of the form "The application must have high quality". This sort of
requirement has the Felix Frankfurter property "I know it when I see it".
My conclusion:
a. EventsLost should be added in the next version of DECwindows.
b. Handling of that event should be added in the next (or current, depending
on planned release date) of every DEC product, an in particular to
all widgets.
c. Enabling of the sending of EventsLost (per PSW) is the job of the main
program (via an Xmumble or XtMumble call). Widgets don't do this.
Our applications do, of course, as soon as they have been fixed to handle
the event.
My guess would be that most of the work is in the widgets; the changes to
support the new event would be minor for most applications (though not for
all, obviously). So by doing the 3 steps I mention, we create the message:
1. Our application are now more robust than before (and indeed more so than
any others in the industry).
2. Your applications can be, too, with -- usually -- a small amount of effort.
Just use the new widget library, which, of course, is upward compatible,
work out the recovery you need, and issue the enabling call.
paul
|
60.98 | Calvin & Hobbes Engineering Inc. | STAR::BRANDENBERG | Intelligence - just a good party trick? | Tue Mar 07 1989 10:36 | 24 |
|
I suppose my perception of a need for enforcing a policy comes from
notes in other conferences with references to "mission critical" X
applications. Specifically, both NASA and ESA are looking at using X
in their manned space programs with ESA actually using it for spacecraft
instrumentation. Since Joe Astronaut probably isn't going to start an
Xtrek session during a flight perhaps I'm overreacting but it was the
possibility of applications such and a rather cavalier attitude about
what constitutes sufficient testing in a system exhibiting stochastic
behaviour that prompted my original tirade around reply .28.
There was a Calvin & Hobbes cartoon some years ago that went like this:
Calvin: Dad, how do they get the load limit for bridges?
Dad: Well, Calvin, they drive bigger and bigger trucks over it until
it breaks. Then they rebuild the bridge and weigh the
last truck.
Calvin: Oh! I should have guessed that!
Unfortunately, this is exactly how software load limits are determined
today.
monty
|
60.99 | Is anyone actually going to do anything? | WINERY::ROSE | | Tue Mar 14 1989 12:37 | 7 |
| This is an interesting discussion. Sorry I missed most of it while on
vacation... But is anyone taking an action item to try and get
EventsLost added to the X protocol (for example, in the ANSI
standardization process)?
Re .97: By Felix Frankfurter I think you mean Potter Stewart.
|
60.100 | let the server do it! | NEXUS::B_WACKER | | Fri Mar 24 1989 10:49 | 53 |
| Since xlostevent is so fraught with problems and so unlikely to make
it past MIT how about another approach? Use the terminal driver model
(.88) so the session manager knows of the problem before it is too
late. Create a modal message in the offending process window that
says something like "This process is a hog and you can either wait
till it's eaten its fill or push the kill button (in the box) if you
want to get rid of it." Do all the previously suggested beeping,
freezing keyboard, skull and crossbones, etc to tell the user that
no other input will be accepted for this window other than the kill
button.
What about all the other windows?
First version you could just stall them, too. Send a message to the
clients to block until they get a message that the hog is satisfied or
consciously killed by the user.
Second version, you could make the server smart enough to watch for
actions that generate messages to the hog. If that happens then stall
the initiator and give it a box that says "waiting for the hog, kill
or wait." That way if there's no cross-process communication going on
the hog is the only one to suffer. You can still have real-time
graphics output to the thermometer for your nuclear coolant! You
could still move a window that is partially obscured by the hog out
from under it and do virtually everything where there's no geometry
interaction with the hog.
Advantages I see:
1) The only new protocol is the stall to the client (which may already
be there for sync??)
2) The user (not the application) is in control of whether or not the
connection is terminated.
3) All the implementation is in the server so upward compatibility is
guaranteed.
4) A session manager option could enable this functionality or
the current "tough s__t" approach.
5) It could be a very important differentiation feature between our
server and other vendors' if MIT drags their heels.
6) You completely avoid the impossible problem of how to recover from
lost events because you don't lose any. (There really is no general
solution to this problem, is there?)
7) The user is in control.
8) The user is in control.
Bruce
|
60.101 | | KONING::KONING | NI1D @FN42eq | Fri Mar 24 1989 11:17 | 15 |
| I don't see how that can work. Some events are indeed caused by user input,
and you could perhaps block those. But a lot of other events come from the
actions of other clients -- for example, expose events occur if another
window is moved, resized, deleted, or iconized. You can't block those
operations because then you would affect other clients. (If you think it's
ok to affect other clients, you might as well just halt the system when this
problem occurs.) If you can't prevent other clients from doing the things
that generate these events, then the only alternative, given that you
have no place to store the events, is to discard them and let the affected
client know that this happened.
What' the big deal? This is elementary stuff in distributed systems design.
paul
|
60.102 | | DECWIN::FISHER | Burns Fisher 381-1466, ZKO3-4/W23 | Fri Mar 24 1989 13:12 | 11 |
| Personally, I think the first thing to do is to reduce (but not eliminate)
the problem by giving the server the capability of saving the event away
and trying later while continuing to process requests. Obviously this
can't go on forever. The server runs out of memory if a client sits
around idle long enough. However, it goes a long way toward alleviating
the short term problem. In the long term, I think you have to tell a client
that it has lost something. However, we want to minimize the frequency we
have to do this.
Burns
|
60.103 | let the user decide | NEXUS::B_WACKER | | Fri Mar 24 1989 15:33 | 20 |
| >(If you think it's ok to affect other clients, you might as well just
>halt the system when this problem occurs.) If you can't prevent other
>clients from doing the things that generate these events, then the
>only alternative, given that you have no place to store the events, is
>to discard them and let the affected client know that this happened.
You only affect the clients that are muddying the waters of the one
who's run out of resources. True that could escalate, but the USER
could still abort if it is the "wrong thing". A bad apple in the
barrel affects everyone sooner or later.
>What' the big deal? This is elementary stuff in distributed systems
>design.
Doesn't that imply a design where the server is capable of restoring
all the lost context or that the client has a copy of the server
database, neither of which obtain here.
Bruce
|
60.104 | | KONING::KONING | NI1D @FN42eq | Mon Mar 27 1989 13:18 | 10 |
| Resource issues caused by the fact that process P is not running as fast as
it needs to should be confined to process P, and should not affect other
processes. That's what I was pushing for. The fact that the other processes
are doing things for the same user is irrelevant.
re .102: I agree, reducing the incidence of the problem is a good first step
while we wait for the real solution, if and when it actually comes to pass.
paul
|
60.105 | Why are we still arguing this? | IO::MCCARTNEY | James T. McCartney III - DTN 381-2244 ZK02-2/N24 | Mon Mar 27 1989 20:09 | 14 |
|
>>> A bad apple in the barrel affects everyone sooner or later.
If my application that went off compute-bound was some critical life-support
or fail-safe control mechanism, I'd sure hate for my display server to decide
that it "wasn't playing by the rules" and disconnect it. The point is real
simple, you can't stop all the different sources from which events can be
generated, you can only hope to catch them all. When you can't, you must do
something reasonable. The lost-events event is the clasical way to handle this
type of flow control problem. It's not perfect, but at least the failure modes
are such that you can recover.
James
|
60.106 | | CVG::PETTENGILL | mulp | Tue Mar 28 1989 15:45 | 6 |
| Here's a variation on the flow control problem. Try xlsfonts with the full
info option and watch as the server `hangs'. As the `man page' for it says
under `bugs', this is a problem with the single thread server design.
ELKTRA::DW_EXAMPLES note 116
|
60.107 | | ULTRA::WRAY | John Wray, Secure Systems Development | Wed Jul 26 1989 15:59 | 3 |
| Any news on this issue? Are the MIT people looking at it, or have they
defined it to be a non-problem?
|
60.108 | | DECWIN::FISHER | Burns Fisher 381-1466, ZKO3-4/W23 | Thu Jul 27 1989 13:24 | 11 |
| I talked to Bob Scheifler about it. He believes it is a non-problem. Monty
Brandenberg was going to make a proposal for fixing it. However, he decided
to leave the company and take up consulting before he could get to it.
Version 2 of DECwindows relieves the problem to a large extent by doing
additional buffering with DECnet. As has been discussed before, this does
not truly solve the problem, but it does reduce the cases where we see it.
In fact, I have not seen it at all since this happened.
Burns
|
60.109 | | ULTRA::WRAY | John Wray, Secure Systems Development | Sat Feb 03 1990 16:05 | 18 |
| I don't understand how he can view it as a non-problem. Without
application-level flow-control (and lost-event handling) of some sort,
I can write a non-privileged application which can cause other
applications sharing the same display server to crash. Bugs in one
application can cause other applications to crash. Glitches on the
network which tear down the transport connection can cause applications
to crash. I've just demonstrated that a user with a quick mouse finger
can kill random applications (although it is true that this is more
difficult now than it was under VMS DECwindows V1).
All this boils down to "X, as defined at present, is inherently
unreliable", which seems to mean that it is unsuitable for most
process-control applications.
Or am I missing something?
Is there any record of Monty's proposed fix? Is it being followed up
by anyone else within Digital?
|
60.110 | A voice from the past coming back to haunt?
| DECWIN::FISHER | Burns Fisher 381-1466, ZKO3-4/W23 | Mon Feb 05 1990 13:01 | 5 |
| He never wrote anything down. No, it is not being pursued. It is very hard
to pursue a theoretical problem when there are millions of problems that
customers see (and complain about) every day which are waiting to be solved.
I agree...it is not fixed or solved. However...
|