[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference gyro::internet_toolss

Title:Internet Tools
Notice:Report ALL NETSCAPE Problems directly to [email protected].rnet? Read note 448.L for beginner information.
Moderator:teco.mro.dec.com::tecotoo.mro.dec.com::mayer
Created:Fri Jun 25 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:4714
Total number of notes:40609

4577.0. "AV Search Service - Adding a URL" by MKOTS3::HAHN (SBU Americas Technical Support Group) Wed Mar 26 1997 12:49

We received a call from an individual who set up a web site for a non-profit 
organization (Kabbalah Centres of Florida). The site is hosted by an ISP 
(Icanect). He wanted to add the new site so that the AltaVista Public Search 
Service would be able to find and index the site. So he went to 
http://www.altavista.digital.com/ and clicked on the "Add URL" to submit his 
request. The URL he added is:

http://www.icanect.net/kabbalah/

Well, it's been a few days now and he is complaining that the AV Search Service
won't generate a hit on his site. 

I did some testing myself. I can access both www.icanect.net and 
www.icanect.net/kabbalah directly from Netscape (Open Location...).
Now, if I go to AV Search and enter "icanect", I do receive a number of
hits to the Icanect home page, However, if I click on any of them, I receive
error 404:

	The page you have tried to access no longer exists, has an incorrect
        URL, or never did exist. Please check the URL carefully and retype it
        if necessary. 

        We have recently renovated our entire website. Much of our former
        website has been integrated into this renovation. Feel free to click on
        the links to the left to navigate you through our new website. 

        If you feel this is an error we may have overlooked, please notify us
        by clicking our night owl. 

        Sincerely,
        The Icanect Team 
 
Now, I'm thinking that he can't get a hit for the /kabbalah site because the
spider hasn't indexed the "renovated" main link into that site - the ISP's 
www.icanect.net. Is this correct? What needs to be done at this point? Does 
the ISP need to perform an "Add URL" for www.icanect.net?


T.RTitleUserPersonal
Name
DateLines
4577.1"too many URLs..."LGP30::FLEISCHERwithout vision the people perish (DTN 381-0426 ZKO1-1)Wed Mar 26 1997 12:5870
        This note touched upon a pet peeve of mine regarding the
        AltaVista public search site.  I recently gave them the
        following feedback about a problem in adding a URL:

Date: Fri, 21 Mar 1997 17:12:05 -0800
From: Bob Fleischer <[email protected]>
Organization: Digital Equipment Corp NSIS
To: [email protected]
Subject: "too many URLs..."

My son is developing a web site, and he wanted to get it listed on a
variety of search engines and web directories.  The URL is
http://www.tiac.net/users/rjf/fizbin.html .

He was able to get it listed by Lycos, Excite, InfoSeek, Yahoo, and
several other services suggested by SubmitIt! (SubmitIt was suggested by
AltaVista, in fact).

However, he was not able to get AltaVista to accept his URL -- the
response was "Too many URLs at that site have been submitted---sorry."

Such a rejection wouldn't have hurt half as bad if it came from a site
which had no claim to being the biggest and fastest site built with
technology having the highest capacity -- but it seems absurd that
AltaVista of all sites should be the only one to say "I'm sorry, I've
reached my capacity!"

I'm sure that there are a lot of submissions from a site such as 
www.tiac.net -- they are, after all, a large ISP hosting thousands of
users and businesses.  No other search engine had a problem with that,
however.

-- 
Bob Fleischer
Digital Equipment Corporation, Network and Systems Integration Services
110 Spit Brook Road (ZKO1-1/J33), Nashua, NH 03062       (603) 881-0426
[email protected]             http://www.tiac.net/users/rjf/

        and their response:

Date: Mon, 24 Mar 1997 09:09:07 -0800 (PST)
From: Alta Vista Support <[email protected]>
To: Bob Fleischer <[email protected]>
Subject: Re: "too many URLs..."

From [email protected] Mar 18 09:01:47 1997
Date: Tue, 18 Mar 1997 09:00:51 -0800
From: Ty Tressitte <[email protected]>
To: [email protected]

In order to help ensure our users retrieve a fair and unbiased selection 
of relevant pages, we have had to limit submission of pages from a 
provider, site or individual user that fall into the following categories:

1. Submissions that are designed to limit our ability to rank pages 
accurately.

2. Duplicate and near duplicate pages that both utilize excessive storage 
and reduce the number of relevant search results.

3. Excessive amounts of pages submitted from a given provider, site or 
individual in a day.

You can try submitting your page in the late afternoon when the
submission limits are reset.


Regards,

AltaVista Support
4577.2BUSY::SLABCrazy Cooter comin&#039; atcha!!Wed Mar 26 1997 13:017
    
    	Yeah, I just tried submitting
    
    	http://www.icanect.net
    
    	and got the same "too many URL's" error.
    
4577.3MKOTS3::HAHNSBU Americas Technical Support GroupWed Mar 26 1997 13:107
Thanks for the quick replies.

So I guess I need to send an e-mail to AltaVista Support and ask them to reset
the submission limits for www.icanect.net?
    

4577.4HYDRA::VANORDENWed Mar 26 1997 17:3532
    
    >So I guess I need to send an e-mail to AltaVista Support and ask them
    >to reset the submission limits for www.icanect.net?
    
    
    Yes.
    
    What is confusing is that your customer did not say they received the 
    "too many URLs at this site" error when they originally submitted their 
    URL.  It could be that the URL was accepted, but for some reason could 
    not be fetched (network busy, server was down due to the renovation of
    the site,etc).  Under those circumstances the URL is not added.
    
    The AltaVista Search Site has had problems with spamming...submitting a 
    large number of pages to the index in the hopes that it will increase your 
    appearance and ranking.  I suspect that a/v views this as an ethical
    issue rather than a performance issue.  As they put it, 
    "Left unchecked, this behavior would make web indexes worthless".  It
    seems they attempt to solve this problem by putting a quota on URLs,
    and expect scooter to find the other pages.  Unfortunately this method
    does not take into consideration the many unique web pages
    which reside in directories on the same HTTP server (such as
    www.icanect.net and members.aol.com).  It is also possible that too
    many attempts have been made to add the
    http://www.icanect.net/kabbalah/ site that it now looks like an attempt
    to spam, and AltaVista is ignoring the site.
    
    You need to send mail to AltaVista Support alerting them of the problem
    so they can bypass the quota on your customer's behalf.
    
    Donna 
    
4577.5VAXCPU::michaudJeff Michaud - ObjectBrokerWed Mar 26 1997 20:458
>     	Yeah, I just tried submitting
>     	http://www.icanect.net
>     	and got the same "too many URL's" error.

	What are you talking about Shawn?  1pm (when you posted your
	note) is not late afternoon if that's what you are refering to?
	And it's especially not late afternoon if it's Palo Alta time
	(add 3 hours for ET).
4577.6BUSY::SLABDon&#039;t drink the (toilet) waterThu Mar 27 1997 01:256
    
    	If I remember right, I posted that reply before reading the prev-
    	ious one with the time recommendation.
    
    	I guess I should of written it quicker.
    
4577.7Someday when AV Forum merges with Notes, maybe we'll hear from AV employees...CIRCUS::GOETZETibetan karma not Made in ChinaThu Mar 27 1997 16:3723
    The "problem" with AV not indexing all of a site has become a public
    issue:
    
[ Forwarded message ]
    
>AltaVista also is not as deep a search engine as you might think.  The
>email below is from zdnet's "talkback" area.  The URL is
>
>http://www5.zdnet.com/anchordesk/talkback/talkback_11638.html
>
>The author reveals that AltaVista doesn't index all pages on a site --
>indeed, geocities.com, with 20,000+ pages, only has about 300 listed. 
>You can check this by doing a search for host:geocities.com on
>AltaVista.

    
    If this is true, I'm just as shocked as the writer above. I thought 
    that the original conversation by some DIGITAL researchers at 
    Left at Albuquerque about the idea for AltaVista was essentially 
    "index the entire Web". If the scope of the project has at some point 
    been scaled back, it seems to have been done without any public notice.
    
       erik
4577.8BUSY::SLABForm feed = &lt;ctrl&gt;v &lt;ctrl&gt;lThu Mar 27 1997 17:177
    
    	I'm not sure how often the index is updated, but whatever the time
    	span is it's too long.
    
    	All too often I get "not found" errors because the page has dis-
    	appeared since the index was last created.
    
4577.9QUARK::LIONELFree advice is worth every centSun Mar 30 1997 21:3420
    AltaVista has indexed only a handful of the ourworld.compuserve.com
    sites - my page there keeps getting dropped from AV's list.  The
    problem, as Louis Monier once explained to me, is that AV tries to be a
    "good citizen" and stops indexing a site if it is pulling down what it
    thinks are too many pages.
    
    One unfortunate side effect of people having difficulty getting URLs
    indexed is that some of them decide that the way to do it is to "spam"
    the Digital e-mail list with demands to get the site indexed.  One jerk
    set up a batch job to remail the same complaint to the entire list
    every day.
    
    (Another jerk decided that he DIDN'T want his site indexed - he refused
    all suggestions for how to do this on his own and blasted the e-mail
    list dozens of times a day with incoherent rantings.)
    
    Unfortunately, it seems that the AV people don't have adequate staff to
    respond to inquiries.
    
    					Steve
4577.10CIRCUS::GOETZETibetan karma not Made in ChinaTue Apr 01 1997 15:4612
    
    re: not indexing the entire Web:
    
    I hear it's simply a budgetary problem--if AV were to index
    all known pages *and* keep the same responsiveness level it has today,
    it would take three times as many turboLasers as they use today 
    (12, each with max CPU boards?). 
    
    That's too bad. I might start using a search engine which does attempt
    to index all pages.
    
       erik
4577.11well, they're DigitalLGP30::FLEISCHERwithout vision the people perish (DTN 381-0426 ZKO1-1)Wed Apr 02 1997 07:2612
re Note 4577.10 by CIRCUS::GOETZE:

>     That's too bad. I might start using a search engine which does attempt
>     to index all pages.
  
        Yes -- indexing everything is AltaVista's claim to fame.

        (Yes, I understand that it was never possible to really index
        "everything", but one would expect that static, un-hidden
        pages would certainly be indexed.)

        Bob
4577.12Spiders can only crawl over linksSTAR::COPEWed Apr 02 1997 10:4016
    Also, AV only indexes pages its crawler can find (via links from
    other pages), correct? If I were an ISP, and allowed my users to have
    homepages at
    
    http://myisp.com/~joeuser/index.html
    
    Joe's page would never make it into AltaVista unless and until there
    was a link to get there from some page in AV's space (which, I
    expect, starts at large sites like Yahoo and works its way down?)
    Isolated groups of pages with no outside references just aren't
    going to make it.
    
    (I'm just guessing here; this may not be all that relevant to which
    pages get passed over... but it seems like another thing to consider.
    Spiders can only crawl; they can't jump.)
    
4577.13AltaVistaCONSLT::OWENStop Global WhiningWed Apr 02 1997 10:4421
    I think AltaVista might save some space if they put a size limit on
    pages that it indexes.  More and more I'm seeing log files and other 
    data dumps get indexed.  Pages that are many many megs in size.  And
    since these contain SO MUCH information, they often pop up in search
    results even though they have nothing to do with what you're
    searching for.  
    
    I don't think it is unreasonable to ask that if a page is to be
    indexed, its size is kept to a reasonable size.  I don't know what
    that number is, but it's certainly less than 1 meg.  Maybe 200K. 
    Lower? Higher?
    
    How about dumping pages that haven't been updated in over a year?
    
    Like others have said here, it's really a shame that AltaVista punishes
    people on large ISPs with lots of pages.  TIAC is a good example.  If
    some bozo on TIAC spammed the index, don't make everyone else pay for
    it.
    
    -Steve
    
4577.14DECCXL::WIBECANThat&#039;s the way it is, in Engineering!Wed Apr 02 1997 10:528
>>    How about dumping pages that haven't been updated in over a year?

Bad idea.  There are a great many sources of information that do not change
over the years.  If someone were, for example, to supply the complete works of
Shakespeare over the web, there would be little reason to change the pages,
ever.

						Brian
4577.15BUSY::SLABDancin&#039; on CoalsWed Apr 02 1997 11:226
    
    	RE: .12
    
    	That doesn't explain something like Geocities, though, which has
    	links to all of its pages on the main page.
    
4577.164MB limit imposed?WOTVAX::16.194.64.183::watsonOK, whats todays long term strategy?Wed Apr 02 1997 12:326
re: .13

According to the book "The AltaVista Search Revolution" page 91, files over 
4MB are truncated.

-- Rob
4577.17The complete works of Shakespeare...STAR::PITCHERSteve Pitcher/Pathworks for OpenVMSMon Apr 07 1997 13:5711
    re: .14
    
    What "If"!
    
    >> If someone were, for example, to supply the complete works of
    >> Shakespeare over the web, 
    
    See:    http://the-tech.mit.edu/Shakespeare/works.html
    
    
    - 	stp
4577.18Adding personal URL to AltaVistaVMSNET::RRICKI&#039;d rather be fishing!Wed Apr 09 1997 21:1410
    It is possible to add your personal URL's of the form,
    www.network.com/~myusername
    
    At the bottom of the ALtaVista search page is an option
    Add Url. Just place your personal web page there and you're all set.
    
    I did so and showed up in Alta Vista the next day.
    
    Randy
    
4577.19QUARK::LIONELFree advice is worth every centThu Apr 10 1997 12:463
But keep a watch out - it is likely to disappear after a month or two.

				Steve
4577.20PCBUOA::BAYJJim, PortablesThu Apr 10 1997 16:0512
    On the other hand, getting a page *out* of some of the other search
    engines is darn near impossible.  I have a page that was indexed in
    November.  A couple months ago I placed noindex meta tags and a robot
    file there, but it is STILL there, still showing a data of November.
    
    Some pages may get updated frequently, but not all.  Either that, or
    once a page is cataloged, when they go back, they only check if its
    there, and don't look for the meta tags or robot file during the
    refresh pass.
    
    jeb
    
4577.21the index is *very* stickyFIEVEL::FILGATEBruce Filgate SHR3-2/W4 237-6452Sun Apr 13 1997 19:289
 There are lots of dead links in the Altavista index, almost as if
 once an entity is indexed, it is never again checked.  More likely
 the memory algorithm is just overly sticky.  Sorry to .-1,  but the
 dead links I checked had been dead much longer than a couple of
 months, one appeared to have been taken down 10 months before 
 Altavista  pointed me there.

 Bruce
4577.22JAMIN::OSMANEric Osman, dtn 226-7122Tue Apr 29 1997 11:2712
    
    I've never wanted a page of mine altavista'd.  But if I did, and
    I just found the owner of some page that was already altavista'd and I
    convinced that owner to link to my page from theirs, then wouldn't
    mine automatically be altavista'd within a week or two (how often does
    altavista do its crawl ?  How long does it take to crawl ?)
    
    Would this method work better than the direct-email method (which seems
    to dead end in too-many-requests error)
    
    /Eric