T.R | Title | User | Personal Name | Date | Lines |
---|
655.1 | | KONING::KONING | Paul Koning, A-13683 | Thu Jul 23 1992 19:06 | 4 |
| How low is low? There are a lot of different answers that make sense depending
on whether you mean 1 microsecond, 1 millisecond, or 100 milliseconds.
paul
|
655.2 | | MSBCS::KALKUNTE | Ram Kalkunte 293-5139 | Thu Jul 23 1992 21:28 | 35 |
| As in .1, I would like to know what is the target latency your customer
has in mind. But generally ....
>>If one has two stations on an FDDI ring (probably should be any LAN),
>>and you want a user process to send a low-latency signal to another user
>>process on the other system, what would be the best protocol (standard
>>or non-standard)?
You can obviously get better performance with customized, light-weight
protocols ( I am assuming this is what you mention as non-standard).
>>And, if you did this, could you also send standard DECnet and TCP/IP
>>packets across the network using the same adaptor? The real target
>>would be Alpha/OpenOSF, but Alpha/OpenVMS would also be useful, and
>>any system/OS could be used for development.
Definitely possible.
>>The customer may be willing to write his own device driver to make this
>>happen.
It's not the device driver that he should plan to write, it is the
application (with comm protocol). There is not much fat that you can
remove by writing your own device driver.
>>The goal is to minimize the number of instructions used by
>>both the sending and receiving nodes, and therefore minimize the latency.
>>The limitation is that there may be a moderate number of processes on
>>a node (less than 100) which want to send or receive using this mechanism.
This cannot be answered without a complete set of requirements for this
application. It may not be a limitation if the application is designed
correctly.
Ram
|
655.3 | 10-30 microsecond CPU overhead | LEMAN::MBROWN | | Fri Jul 24 1992 08:35 | 37 |
| Sorry, I should have been more specific.
What is desired is System induced latency (as opposed to transmission latency)
of about 10 microseconds for the combination of send and receive overhead.
This would be on an Alpha Desktop system, so a DS5000-240 should be about
30 microseconds.
The reason for the low number is to use the spare workstation cycles as a
low-cost MPP during the evening hours. I have talked to the MPSG group,
but their efforts are not directly relevant, at least not now.
IBM is currently pushing RS6000's with PVM (Parallel Virtual Machine)
software from ORNL plus Ultranet as an interconnect. Many applications
will not work using standard PVM over TCP/IP because the communication
latency for signals and small data packets are too long (multi-milliseconds).
What I am looking for is 1) the lightest weight, 2) closest to standard,
3) easy to implement [or better yet already implemented or prototyped]
protocol that is available. Time to customer is more important than
going from 35 to 30 microseconds.
> This cannot be answered without a complete set of requirements for this
> application. It may not be a limitation if the application is designed
> correctly.
The problem is that it isn't a single application, but the use of a series
of systems to run different applications distributed over the whole group.
Multiple of these applications (I said 100 before, but probably usually 10)
may be running at a time. As mentioned in -.1, I made a mistake in that it
probably isn't a driver that needs written, but a simple high level interface
with a little bit of logic to do a few to few mapping.
Anyway, any suggestions of things on the shelf would be useful.
Thanks,
Michael
|
655.4 | Re-examine your application! | KONING::KONING | Paul Koning, A-13683 | Mon Jul 27 1992 15:52 | 26 |
| I don't know what to do with "system induced latency (as opposed to
transmission latency)". Total latency is a meaningful metric, which can
then be split up in latency contributions all of which depend on choice of
technology.
What do you think the network latency will be? I suspect you're counting
on network latencies on the order of 10 microseconds also. (If you're looking
at significantly larger numbers then the 10 �s for the rest of the system
makes no sense.)
A latency guarantee of 10 �s is nearly impossible for FDDI (and, I suspect,
for anything else). The station delay is 1 �s per station; add to that the
cable delay and you can see that you have consumed your whole budget already
except if the network is VERY small. Now you have to add the time waiting
for a useable token. For async (normal) transmission mode, the worst case
delay is (N-1)*TTRT; for sync transmission mode, it's 2*TTRT. TTRT is a
network parameter often set to 8 ms; its lower limit is 4 ms.
You might have better luck using Ethernet, actually... since then you don't
have to wait for the token.
Could you explain what this application is doing that requires such a low
latency? I've never seen numbers anywhere close to this low. What you're
looking for is, at BEST, pushing the state of the art.
paul
|
655.5 | Simulating Nuclear collitions | BONNET::LISSDANIELS | | Tue Jul 28 1992 08:38 | 22 |
| Paul,
I believe they are gearing up to simulate what happened in a nuclear
collition like an experiment in the CERN collider. HEP stands for
HIGH Energy Physics... They may e.g. want to track the paths of the resulting
particles...
If they throw enough with APLHA workstations at the problem it should
be a zinch ;-)
as for the network - Maybe this THE job for GIGAswitch ???
In Full Duplex Mode you would not have to wait for a token,
the GIGAswitch is the only "station" between sender and receiver.
So the distance would then be the only variable for the network
delay - provided the traffic is well spread between the participating
CPUs...
So that brings us back to the inital question -
Any good reliable, but leightweight protocols out there ?
Comments anyone ?
|
655.6 | | KONING::KONING | Paul Koning, A-13683 | Tue Jul 28 1992 12:02 | 15 |
| What I meant is: what properties of the application require this sort of latency?
Compute-intensive simulation is an obvious application for a high BANDWIDTH
network, but it does not impose a low latency requirement. So I'm still
looking for an explanation. It may well be that the requester is confused
and we simply need to straighten out the requirement. It may also be that
the requirement is valid, but it's a lot easier to answer a requirement if
there is a clear definition of the background that justifies it, and there
hasn't been.
Yes, Gigaswitch seems like the only interconnect technology that would meet
the numbers quoted. But keep in mind you also have to get the data through
the adapter, across the bus, and through the software (that's probably the
list of increasing order of slowness...).
paul
|
655.7 | Wow ! What are they willing to spend ? | MSBCS::KALKUNTE | Ram Kalkunte 293-5139 | Tue Jul 28 1992 13:27 | 32 |
| Well, for some of the reasons as outlined earlier, FDDI (asynchronous)
was never a right choice for such applications (even though I am still
having a hard time figuring out what exactly this application is ).
The ideal protocol for such communication would do its own flow control
and would be engineered to work with a given network. In any case, the
latency goal of 10usec seems unreasonable with existent technology;
DEMFA, the fastest FDDI adapter to date, takes ~6 usec (best case) to
deliver the smallest FDDI packet from the fiber to the memory. An
average case will consist of queuing delays in the adapter and the
memory. An average case will also account for packet size. I do not
know what your average packet size would be (?) and what your average
system will be doing (?), so I cannot comment on what your end-to-end
transmission latency will be with FDDI. This latency obviously does not
include even a single CPU instruction to process the packet.
Also, IO bound tasks behave differently than compute bound tasks, and
the bottom line is the CPI that you get for IO programs are typically
much worse than CPU bound programs. I mention this so that people will be
careful when coming up with how many instructions there should be in
the run-time loop for this application.
Since the kind of beast that you are looking for hasn't evolved yet
(me thinks), don't waste time looking for it. If there is considerable
bucks on the line to make this happen, it would be a good idea to
write your own application. But this has to be with a revised
expectation of latency. If you need an estimate of what is achievable,
(much better than IBM's millisecond range), I will need to understand
your application. Either you can post the details here or we can discuss
offline.
Ram
|
655.8 | Setup=wasted instructions | RDVAX::MCCABE | | Mon Aug 24 1992 12:32 | 35 |
| Maybe I can offer some help with the low latency requirement.
Distributed compiler technology provides automatic parallelizm for
array based operations. The result is that a data movement to another
processor can use the CPU cycles of many other processors. However,
the cost to initate a send/recieve pair equates to instructions that
could be used locally to process the data.
A 50u latency second on a MIPS workstation is on the order of 1200
instructions. If the compiler does not have a good idea of how
long the remote procesing step is going to take it becomes quite
possible to spend more on the communication than the local processing
would take.
As the numbers move up in magnitude, the cost of the remote processing
becomes relatively expensive due to the latency. Hence less
distribution is more efficient.
Granted there are many course grained applications that can still
benefit even when the latency is accounted for, but the total set
of applications is reduced.
Matrix reductions, distrubuited AXPY's, even SUM operations can be
done very quickly in parallel when communication is cheap. When it
is not, the addition of processors to a given problem can result in
longer, not shorter execution times.
GIGIswitch does indeed look like a good mechnism for this distribution
mechnism.
-Kevin McCabe
Engineering Manager, MPSG
P.S. We may indeed be quite interested in what you are doing ...
|
655.9 | Thanks and more details | LEMAN::MBROWN | | Tue Sep 29 1992 06:45 | 46 |
| I apologize for not getting back to this sooner. We have been swamped with
Alpha activity, several big conferences, and MPP work.
I will get in touch with Kevin and Ram independently, but let me say that
Torbjorn and Kevin are 100% on target. We are planning on using GIGAswitch
as the interconnect, and 10 uS is still an interesting target number.
Actually, I would go farther than Kevin and say that Setup time equates to
wasted instructions on MANY systems. And, it isn't just setup time. It is
the time required for copying data from one buffer into another into another
and finally into user buffers.
The applications are not constant. Some will have large transfers, some
will have small transfers, most will have a mix of transfers. However, from
my experience in other parallel processing environments, it is the issue of
synchronization latency (small packets) which is the most critical issue.
There will likely be two or three modes of operation, and this might equate
more directly to Ram's request for "application information".
The first mode is 10 Alpha workstations acting as a batch compute engine.
Uninteresting for special communication protocols.
The second mode is using a "data flow" programming model like PVM (Parallel
Virtual Machine) developed by Jack Dongarra and promoted by IBM (and hopefully
Digital) as a way of using workstations to solve medium-to-fine grained
parallel problems. Among other things, PVM provides a programming library
that hides details of the location of program modules and the communication
between them.
Dongarra's graduate students developed a special program library for efficient
ethernet communication, the same is needed for an FDDI GIGAswitch environment.
IBM has done this for their version of PVM (called PVM/e) over Fibrechannel
connections.
The third mode of operation is where High Performance Fortran applications
are automatically distributed across multiple "workstations", and they are
linked together via a high speed network. FDDI is probably too slow, but
it is the best we have right now.
The shortest term need is for the PVM style support, but the HPF style support
will be very close behind. I expect that Kevin is already working on it.
Thanks for the help. More later when it becomes available.
Michael
|
655.10 | | KONING::KONING | Paul Koning, A-13683 | Tue Sep 29 1992 12:09 | 7 |
| I don't see anything in that list that suggests severe latency requirements,
certainly nothing anywhere near as tight as 10 microseconds. So I'm still
wondering how you came to the conclusion that such performance was needed.
(Never mind whether it's achievable with any hardware available from anyone
today.)
paul
|
655.11 | Missouri Requirement <show me> | LEMAN::MBROWN | | Tue Sep 29 1992 12:56 | 20 |
| Paul,
You are right that there isn't a requirement that 100% of all latency be
under 10 microseconds. The original number I used in note .3 was 30 micro-
second delay for the application on system 1 to begin the transmission of
a small packet (say 100 bytes of useful data) until the application on
system 2 has the data in its buffer. There should be a reasonable confidence
level that the transmission will complete in this amount of time.
Until I see otherwise, I will assume that this cannot be done using standard
UDP packets, transparent or non-transparent DECnet.
Paul, if you or anyone else can show how long this takes using standard
protocols, I would love to see the data and be proved wrong. This would be
using GIGAswitch, so some of the default assumptions about token availability
are not valid. Tests on 2 node rings would be of high value.
Regards,
Michael
|
655.12 | | KONING::KONING | Paul Koning, A-13683 | Tue Sep 29 1992 15:33 | 46 |
| I don't know how long this takes with standard protocols. Actually, that's
a fairly meaningless question; the more meaningful question is how long it
takes on a given implementation. (The particular implementation properties
are what determines the answer, not really any common properties of a particular
protocol.)
Something is backwards here. Requirements are supposed to be derived from
the application's needs. If you can determine what the application needs
(and I'm NOT referring to a number such as "30 microseconds" unless it comes
with some explanation of how it was derived from parameters observable by
users of the system) then you can determine whether a particular implementation
of some particular protocol will do the job. Tests of implementations will
validate performance claims for them and will give you confidence that they
will meet the requirements. But I'm getting the impression that you're looking
for performance data as a way to determine what the performance requirements
should be, and that's not the way to do it.
Looking back at .9:
mode 1 (batch compute engines) -- sounds like bulk data transfer (similar to
file transfer). Requires high throughput, but does not impose any significant
latency requirement.
mode 2 (fine grained parallelism) -- how fine is "fine"? I know this sort of
stuff has been done in academic R&D. To use it in commercial applications
requires picking grain sizes that aren't so small that most of the time
spent is overhead. As far as I know, remote procedure call or similar
approaches for doing this sort of thing currently have overheads measured
in milliseconds, not microseconds. Even if the actual network overhead
were zero, there's the application layer overhead (argument marshalling)
which can be quite substantial. So if "fine grained" refers to operations
that take a second or so, using thousands but not millions of bytes per second,
again you have no special requirements. If your grains complete in a few
milliseconds, you're not going to get much efficiency.
mode 3 (distribution of high performance fortran apps) -- that sounds similar
to mode 1, and again involves no significant latency requirements. How much
data has to be moved? You didn't mention, and that's the real question.
So to summarize: one of the three application modes you mentioned MAY
justify low latency requirements. You'll need to learn more about those
applications to find out the actual numbers. The other two applications
have no latency requirements (beyond the modest ones needed for good
throughput, which any reasonable implementation already meets).
paul
|
655.13 | Another Low-latency Application | JULIET::HATTRUP_JA | Jim Hattrup, Santa Clara, CA | Thu Feb 24 1994 13:02 | 14 |
|
I am looking for a 'reflective memory' type solution for a real-time
application. I am wondering if a low-latency FDDI solution (perhaps
using the Gigaswitch) would work.
A configuration would be 3 to 10 systems, that need to update 50Kbytes
of data between themselves (all of them) at 30 times/sec. This is
1.5Mbytes/sec - and delays in updates would cause problems. They have
33 milliseconds for computation and I/O (30 frames/sec), and can't miss
this window.
Is FDDI a workable solution? (SYSTRAN SCRAMnet is an alternative, but
we don't have VME based mmap support on VAX 7000. Likely config is
a VAX 7000 M620 and 2 to 4 SGI systems)
|
655.14 | | KONING::KONING | Paul Koning, B-16504 | Thu Feb 24 1994 16:04 | 9 |
| Doesn't sound like a big deal. The throughput you need is a small fraction
of that available on a single FDDI ring, so you don't even need Gigaswitch;
just hang the nodes on a private ring. Given the low load, there is absolutely
no channel access delay problem, and the adapters won't introduce any
significant delay either.
Judging by the numbers, you could ALMOST do this on Ethernet... :-)
paul
|
655.15 | | UFP::LARUE | Jeff LaRue: U.S. Network Resource Center | Fri Feb 25 1994 14:32 | 15 |
| re: reflective memory
This is exactly the kind of thing that we are in the process of
creating for an air space management system here at Westinghouse.
I have architected a solution that relies on multicast addressing
inorder to allow every node to be seen by every other node, etc.
We were required to use Ada for the implementation of this capability.
To date, we have found that the bandwidth of a private FDDI ring
is sufficient to handle multiple 10's of alphas with an aggregate
transmission of 10+ Mb/sec. Additionally, the latency is more than low
enough to meet the needs of the program.
-Jeff
|