Last month we began a project to create a REXX program that communicates with the AltaVista search engine, executing a search and parsing the results. So far we have written the code that properly encodes our search string. Now we need to look at how the search server will interpret the request we send and write the code to create the actual request to be sent to the server.
Since there is no public documentation on the inner workings of the AltaVista server, we'll have to do a little detective work with our browser to see how things tick, but it's not too hard to figure out what's going on with a little experimenting. Executing a search for "OS/2 Supersite" with a web browser returns a URL that looks like this:
http://www.altavista.digital.com/cgi-bin/query?pg=q&what=web&kl=XX&q=%22OS%2F2+Supersite%22&search.x=59&search.y=2The first part:
http://www.altavista.digital.comis obviously the address that we will want to connect to with our program.
The next part:
/cgi-bin/querytells the search engine that we would like to use the CGI program called "query" to execute a search. The next character, the question mark, signals the beginning of the input that we want to send to the query program.
To understand how to assemble the CGI program's input, we have to understand how it will interpret it. The CGI program will parse the input on the ampersands. If you look at the above URL, you will see variable = value statements followed by ampersands that terminate the values. When the CGI program parses the input, it will use the values of these variables to perform its functions. So all we have to do is form strings that have the necessary variable = value statements (properly encoding the strings as we discussed last time). The tricky part, of course, is interpreting what the variables mean. For most things, it's pretty obvious. In the above URL, for example, we see a variable
q=%22OS%2F2+Supersite%22&which is obviously the string that we would like to have the server execute a search on. Other variables take a little more work. For example, one variable
pg=q&doesn't seem very obvious. But if you go to the AltaVista main page and put your mouse cursor over the "Advanced Search" link, you will see a search URL with
pg=aq&so obviously the variable pg tells the CGI program whether to do an advanced search or a simple one. We will use the simple search for now. Another variable
what=web&changes when you select whether you want to search the web or the Usenet archives, since changing that to Usenet on the AltaVista web page causes the variable to be set to
what=news&We will concentrate on searches in the database of web sites. Let's go ahead and write the code that will create the query string. First we set up some variables that will, for now, be constant, and then create the query that we will send to the web server. Knowing that this code will be part of a larger program, let's put code in subroutines to make it easier to organize things. The main program for now looks like this:
/* Search engine query program */ crlf=d2c(13)||d2c(10) /* Carriage return - linefeed pair */ Site = "www.altavista.digital.com" SiteCommand = "GET /cgi-bin/query?pg=q&what=web&q=" SearchString = "OS/2 Supersite" UserAgent = "User-Agent: OS/2 REXX Query Program 1.0" /* The name of our program */ Call CreateQuery /* Create the query */ Say Query ExitThe CreateQuery routine is called in the main program and creates the actual query that will be sent to AltaVista:
/* Create the query string to be sent to the web server */ CreateQuery: /* Only allow the routine to see the necessary variables */ Procedure Expose SiteCommand SearchString UserAgent Query crlf /* Create a list of the types of responses we can handle */ Accept = "Accept: text/plain"||crlf||"Accept: text/html"||crlf SearchStringEncoded = Encode(SearchString) /* Last month's code */ Query = SiteCommand||SearchStringEncoded "HTTP/1.0"||crlf||Accept Query = Query || UserAgent || crlf || crlf Returnwhich uses last month's encoding routine to get the non-alphanumeric characters encoded correctly:
/* Encoder routine for URLs */ Encode: Procedure Parse Arg AString OkayChars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' NewString="" Do i = 1 to Length( AString) Test = SubStr( AString, i, 1) if Pos( Test, OkayChars) > 0 Then NewString = NewString || Test else Do If Test=" " Then NewString=NewString||"+" Else NewString = NewString || '%' || c2x( Test) end end Return NewStringThere are several things in the code that need some explanation. First, notice how the variable SiteCommand has "GET" at the beginning. That is the HTTP command that we will issue to the web server to "GET" the web page that we want, in this case the output of a CGI program. The variable Accept contains the types of documents that we are prepared to accept and process, namely ASCII text or HTML text. Finally, the variable Query is formed by concatenating SiteCommand, the string HTTP/1.0, the contents of Accept, and the contents of UserAgent. The HTTP/1.0 string tells the web server that we will use version 1.0 of HTTP when communicating with it.
Notice the double CR/LF pair at the end of the query string. We need two because that is how the web server knows where the end of our request's header is. The server reads lines and interprets them as header lines until it sees a blank line. What follows is the body of the request, which in our case is blank because everything the server needs to satisfy our request is in the header. If we were using the HTTP "POST" command, then we would encode our request and place it in the body of the document.
At this point we're ready to start talking to the web server. Next month we'll write the code that enables us to communicate with a remote web server over the Internet.
Download the source for this month's column: searchav.cmd (1K)
Dr. Dirk Terrell is an astronomer at the University of Florida specializing in interacting binary stars. His hobbies include cave diving, martial arts, painting and writing OS/2 software such as HTML Wizard.
|Copyright © 1998 - Falcon Networking||ISSN 1203-5696|