[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference gyro::internet_toolss

Title:Internet Tools
Notice:Report ALL NETSCAPE Problems directly to [email protected].rnet? Read note 448.L for beginner information.
Moderator:teco.mro.dec.com::tecotoo.mro.dec.com::mayer
Created:Fri Jun 25 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:4714
Total number of notes:40609

4475.0. "Need input on Alta Vista" by 33102::JAUNG () Fri Feb 14 1997 16:23

    We are considering to implement an customized applications
    combine AltaVista search engine for text search over intrnet.
    There are about 2.5 million pages of text data (2.5 GB) stored in 
    Rdb Multimedia version (together with image data) running on VAX 4000.
    Users will access from PC via netscape/explorer.  We'd appreciate any 
    input on the following from people who has experience with it.
    
    1. Do we need Alpha Unix or Alpha NT as middleware?
    
    2. What is the Structural Design by using AltaVista?  Is the
       followng right?
    
    	Build Alta Vista Forum
    	Initiate AltaVista Mail server
    	Access Alta Vista Search Intranet Private eXtension
    	Assemble Alta Vista Directory
    
    3. Approximately, how long it mau take to implement it?
    
    Thanks in advance.
    
    						NYOSS1::Jaung
T.RTitleUserPersonal
Name
DateLines
4475.1teco.mro.dec.com::tecotoo.mro.dec.com::mayerDanny MayerFri Feb 14 1997 17:105
	I think you may have a problem here.  AltaVista Search does not deal
  with databases.  You also need a webserver running on the VAX with the database
  even assuming you can do this.

		Danny
4475.2Web will run33102::JAUNGFri Feb 14 1997 18:0516
    ref .1
    
    Thanks for respond.
    
    Yes, we plan to run Web server (NT or UNIX) on the middleware to 
    submit inqueries to VAX/Rdb.  One of those inqueries is to read
    a table full of texts information (not indexed).  We "hope" he
    Alta Vista search engine can help us to search the text contents.
    
    The problem is as indicated from previous note is that since the
    AltaVista Search does not deal with database, (we think) we may 
    need to build some kinds of index files for specified tables 
    on the middleware for AltaVista to search through ( if we could).  
    When users want more detailed information, based on the pointers, 
    we'll retrieve data from Rdb.  Question is will this work or is
    there any better solutions for this?
4475.3netrix.lkg.dec.com::thomasThe Code WarriorSat Feb 15 1997 12:0514
Having done something similar, maybe I can give some advise.

First of all, you need will a way to access the RDB data via
the Web.  This will probably involve custom CGI work (I wrote
my own HTTP server).

To fill the AltaVista index, you can use scooter to traverse
your RDB data if your pages allow it (ie. noforms, a real
heirarchical URL scheme).

If not, you may consider writing your own program to create
the index files directly (that's what I did).


4475.4LGP30::FLEISCHERwithout vision the people perish (DTN 381-0426 ZKO1-1)Sun Feb 16 1997 11:2710
        Matt's advice in Note 4475.3 is right on -- I just consulted
        with a delivery team on a similar customer project.  If you
        can put a crawlable web interface on the database, it then is
        really easy to have it searched by AltaVista (and in many
        cases you would want a web interface to the data, anyway).

        Otherwise you can use the AltaVista Search SDK to index the
        data (again, as 4475.3 suggested).

        Bob
4475.5Any document available?33102::JAUNGSun Feb 16 1997 15:507
    ref .3 and .4
    
    Thanks for advice.   Is there any documents available to instruct
    how to put a cralable web interface on the Rdb (we do plan to use
    cgi submit SQL call to the Rdb)  ?
    
    
4475.6Consider MS Index Server ?XSTACY::imladris.ilo.dec.com::grainneMon Feb 17 1997 05:1924
If for some reason you decide that the AltaVista approach isn't
feasible, it might be worth looking at the Microsoft Index
Server for IIS v3.0 (formerly codenamed Tripoli.) The SDK
required to develop filters for custom data formats is
pretty well documented. The IFilter SDK also includes 
source code for a sample filter. The documentation, SDK,
and sample filter source code is available from: 

http://www.microsoft.com/iis/default.asp

(select 'Using IIS' from the left navigation pane, then 
'Developing for IIS', then 'IFilter'.)

A friend of mine is using this approach to index a large 
collection of AutoCad drawings, stored as .DXF files. I would
also expect that either MS themselves or third parties will
in the future provide ready-made filters for all of the major data 
types, including ODBC datasources.





4475.7teco.mro.dec.com::tecotoo.mro.dec.com::mayerDanny MayerMon Feb 17 1997 10:057
> If for some reason you decide that the AltaVista approach isn't
> feasible, it might be worth looking at the Microsoft Index
> Server for IIS v3.0 (formerly codenamed Tripoli.) The SDK

	This doesn't run on a VAX.

		Danny
4475.8XSTACY::imladris.ilo.dec.com::grainneMon Feb 17 1997 15:5323
>>> If for some reason you decide that the AltaVista approach isn't
>>> feasible, it might be worth looking at the Microsoft Index
>>> Server for IIS v3.0 (formerly codenamed Tripoli.) The SDK

>>        This doesn't run on a VAX.

>>     Danny

In .2, the basenoter stated that the HHTP server could be on
NT.
Using the MS IFilter SDK, I don't think you require that the index 
server run on the same node as the database server. You could 
run both the index server and HTTP server on NT and use the ODBC 
driver for Rdb and associated underpinnings to access the
Rdb database.

I know of at least one project (outside of Digital) that's
considering this approach, using IIS v3.0 on NT and an Oracle
database on UNIX. They have the problem that their application
consists almost entirely of ASP pages, generated dynamically
from the contents of their database using the IIS v3.0 ADODB
Active Server Component and the ODBC drivers for Oracle. Therefore, 
it cannot be indexed by conventional web search engines. 
4475.9LGP30::FLEISCHERwithout vision the people perish (DTN 381-0426 ZKO1-1)Tue Feb 18 1997 10:0437
re Note 4475.8 by XSTACY::imladris.ilo.dec.com::grainne:

> They have the problem that their application
> consists almost entirely of ASP pages, generated dynamically
> from the contents of their database using the IIS v3.0 ADODB
> Active Server Component and the ODBC drivers for Oracle. Therefore, 
> it cannot be indexed by conventional web search engines. 
  
        Dynamic pages per se are not inherently non-indexable by
        crawlers.

        A web client really cant tell that a given page was the
        output of a program as opposed to a file.

        What it can tell is whether the page was (or would be)
        generated in response to form input, a query, or your basic
        HREF link to a URL.  It can also be told just to stay away
        from certain URLs by the use of the robots.txt file.

        Note that "your basic HREF link to a URL" traditionally
        returns a (static) file, but there is no guarantee of that.

        What makes a web site (or portion thereof) non-indexable are
        forms, queries, passwords, and required cookies.  (Obviously,
        with applets, there are many new opportunities for content
        that can't be crawled.)  If there is a way to get to all the
        content just by clicking on links a crawler should be able to
        find it.  (Most crawlers have logic to detect and cut
        recursive paths, also.)  It doesn't really matter how or when
        the page was created.

        (Obviously, for crawling to produce useful results, a given
        URL when used again should generally return the same page, or
        a later version thereof, and not some totally unrelated page
        or nothing.)

        Bob