[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference msbcs::hpc

Title:Parallel processing through Workstation Farms.
Notice:MSBCS::HPC (renamed from HPCGRP::WORKSTATION_FARMS)
Moderator:MSBCS::SYSTEM
Created:Tue Oct 27 1992
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:507
Total number of notes:1791

505.0. "PSE not start on 1 system" by CSC32::P_HILL () Tue Jun 03 1997 19:19

    
    
    I have a customer that has the same problem on PSE120 and PSE130
    where he has a single file base farm, no NIS or DNS farms are defined.
    He has PSE130 installed on 6 alphaserver systems running various 
    versions of dunix 3.2e, 3.2g and 4.0b. On one system only PSE will not
    start. there are no diagnostics in the system log, pseconfig verify
    does not report any problems. the only information is from lspart
    that reports "no information available" for one of the six nodes
    
    The output that I obtain from running farmd manually is:                        
    
    {166}#/usr/sbin/farmd -farm /pse/nicfarm.db                                     
    farmd: Farm domain (nicfarm) sockets opened successfully                        
    farmd: Ring connection with nic1 established. (Right)  
    which is what normally appears in syslog. Farmd usually                         
    starts and runs without error. "lspart -search"                                 
    returns no information - we are using a file-based farm.                        
                                                                                    
    Everything else on this system is working properly. No known                    
    errors.                                                                         
                                                                                    
    I can make a connection via telnet to the correct port.
    
    $ lspart -partition allmembers                                                  
    Current farm:                                                                   
            nicfarm.db
    Farm Attributes:                                                                
            PSE_LOADSERVERS nic1 nic2                                               
            PSE_PREF_COMM shm mc fddi ethernet                                      
            PSE_FILESYSTEM nics1:/pse /pse nicu1:/u1 /nfs/u1 nicu2:/u2
    /nfs/u2      
    nicu3:/u3 /nfs/u3 nicu4:/u4 /nfs/u4 ni                                          
    cu5:/u5 /nfs/u5 nicu6:/u6 /nfs/u6                                               
            PSE_DEFAULT_PARTITION smp_zinc                                          
            PSE_SERVICEPORT 57039                                                   
    Partition data:                                                                 
    
    allmembers                                                                      
            Members(6):                                                             
            lead.corning.com               No information available                 
            nic1.corning.com               Jobslots = 1     load_avg = 0.09         
            nic2.corning.com               Jobslots = 1     load_avg = 0.00 
            nickel.corning.com             Jobslots = 6     load_avg = 4.10         
            tin.corning.com                Jobslots = 4     load_avg = 3.02         
            zinc.corning.com               Jobslots = 12    load_avg = 2.05         
                                                                                    
            PSE_PREF_COMM shm mc fddi ethernet 
    
    
    I am fairly new to PSE and not sure if I am missing anything,
    looking for any ideas at all
    
    
    Paul  csc  
T.RTitleUserPersonal
Name
DateLines
505.1copy of database file ?HPCGRP::BENSONWed Jun 04 1997 09:406
    Paul,
    
    Do you have a copy of the database file ?
    
    -Ed
    
505.2May be a problem in /etc/hostsNNTPD::"[email protected]"Richard WarrenWed Jun 04 1997 10:1920
Paul,

 From your posting you mention that attempting to start "farmd" by hand
 didn't result in anything unusual being reported.  While it might not
 seem strange that "lead.corning.com" reports a ring connection being made
 to "nic1", it tells me that the string "nic1" is being returned as the
 primary network name by gethostbyaddr() on that machine.   As a guess,
 the hostfile might be redone to have: the fully qualified name as the
 first entry in the list with the "aliases" following, e.g.
 149.42.1.2   corning.com  corning

 Doing the above would remove any doubt about matching database names to
 actual hostnames as returned by gethostbyaddr().  As it stands now, the
 comparision is done by strcasecmp() and would fail when attempting to match
 a fully qualified name against an alias to figure out farm membership.

 Other than that, I'm wondering if I could get access to the machine?
 Richard

[Posted by WWW Notes gateway]
505.3PSE still not starting 1 sysCSC32::P_HILLThu Jun 05 1997 15:2692
    
    After talking with the custommer it looks like this system is in a 
    secure site so we will not be able to get remote access to this machine
    but the customer did update the /etc/hosts file, here's what he sent me
     
     
    It turns out that the /etc/hosts file on the system on which PSE did
    not work had only short host names rather than fully-qualified names. I
    have fixed this, and restarted PSE. From syslog:
    
    Jun  4 11:34:19 lead farmd[649]: Farm domain (nicfarm) sockets opened
    successful
    ly
    Jun  4 11:34:22 lead farmd[649]: Ring connection with nic1.corning.com
    establish
    ed. (Right)
    
    Jun  4 11:34:22 lead farmd[649]: Ring connection with zinc.corning.com
    establish
    ed. (Left)
    Jun  4 11:34:22 lead farmd[649]: Reinitializing
    Jun  4 11:34:22 lead farmd[649]: Warning! Using service entry (nicfarm
    32/tcp),
    which differs from database SERVICE_PORT definition (57039)!
    Jun  4 11:34:22 lead farmd[649]: Farm domain (nicfarm) sockets opened
    successful
    ly 
    And also:
    
    # lspart -partition allmembers
    Current farm:
            nicfarm.db
    Farm Attributes:
            PSE_LOADSERVERS nic1 nic2
            PSE_PREF_COMM shm mc fddi ethernet
            PSE_FILESYSTEM nics1:/pse /pse nicu1:/u1 /nfs/u1 nicu2:/u2
    /nfs/u2
    nicu3
    :/u3 /nfs/u3 nicu4:/u4 /nfs/u4 nicu5:/u5 /nfs/u5 nicu6:/u6 /nfs/u6
    
            PSE_DEFAULT_PARTITION smp_zinc
            PSE_SERVICEPORT 57039
    Partition data:
    
            Members(6):
            lead.corning.com               No information available
            nic1.corning.com               Jobslots = 1     load_avg = 0.20
            nic2.corning.com               Jobslots = 1     load_avg = 0.19
            nickel.corning.com             Jobslots = 6     load_avg = 3.11
            tin.corning.com                Jobslots = 4     load_avg = 0.01
            zinc.corning.com               Jobslots = 12    load_avg = 3.51
    
            PSE_PREF_COMM shm mc fddi ethernet
    
    So, it still does not start up on the same system.
    
    This line from the log is puzzling:
    Jun  4 11:34:22 lead farmd[649]: Warning! Using service entry (nicfarm
    32/tcp),
    
    The definition of the nicfarm port in /etc/services is the same as our
    NIS
    entry:
    
    nicfarm         57039/tcp
    Here's the farm definition:
    
    configuration_data      PSE_PARTITIONS allmembers bigmembers smp_nickel
    smp_zinc
     smp_tin smp_lead
    configuration_data      PSE_DEFAULT_PARTITION smp_zinc
    configuration_data      PSE_LOADSERVERS nic1 nic2
    configuration_data      PSE_UPDATE_PERIOD 60
    configuration_data      PSE_WHICH_LOADAVG 5
    configuration_data      PSE_SERVICEPORT 57039
    configuration_data      PSE_FILESYSTEM nics1:/pse /pse
    configuration_data      PSE_FILESYSTEM nicu1:/u1 /nfs/u1
    configuration_data      PSE_FILESYSTEM nicu2:/u2 /nfs/u2
    configuration_data      PSE_FILESYSTEM nicu3:/u3 /nfs/u3
    configuration_data      PSE_FILESYSTEM nicu4:/u4 /nfs/u4
    configuration_data      PSE_FILESYSTEM nicu5:/u5 /nfs/u5
    configuration_data      PSE_FILESYSTEM nicu6:/u6 /nfs/u6
    configuration_data      PSE_PREF_COMM shm mc fddi ethernet
    allmembers              PSE_MEMBERS zinc nickel tin lead nic1 nic2
    bigmembers              PSE_MEMBERS zinc nickel tin lead
    smp_nickel              PSE_MEMBERS nickel
    smp_zinc                PSE_MEMBERS zinc
    smp_tin                 PSE_MEMBERS tin
    smp_lead                PSE_MEMBERS lead
    
    
    Paul 
505.4Still no progress?NNTPD::"[email protected]"Richard WarrenThu Jun 05 1997 16:0714
Re: .2
The problem is as you noticed the farm daemon which seems to get
a service port that didn't match the database; and when this happens
I issue a warning but take the /etc/services entry as the "real" value.
Port 32 will obviously not talk to port 58039!   From the looks of things,
the database is correct as is the /etc/services (though you should have
nicfarm    57039/udp  in addition to the tcp entry).
If the udp entry is missing, please add it to /etc/services.
In anycase, I'd simply stop the existing farmd and try to restart in the
absence of a reliable explaination!!!

Richard

[Posted by WWW Notes gateway]