Sphider Website Search Engine
When I moved mushroomcompany.com to a new server, the search system I was using was no longer available. Thus the need to find a new approach. I came across Sphider and it seemed to get decent reviews, so I installed it. This is what I learned.
According to the Sphider website:
“Sphider is a lightweight web spider and search engine written in PHP, using MySQL as its back end database. It is a great tool for adding search functionality to your web site or building your custom search engine. Sphider is small, easy to set up and modify, and is used in thousands of websites across the world. “
That sounds like what I needed so I downloaded and began installing Version 1.3.6 (June 4, 2013). Installation is straightforward:
- download the files
- upload to the server
- use phpMySql to create a database and database user
- edit the settings file to provide the correct database information
- edit an authorization file to create a name and password for accessing the admin interface
- run a script to create the tables in the database
No problem except the script would not create a table called “query_log”. “query_log” keeps track of when (using timestamp) and what queries are made using Sphider. The script creates the table like this:
mysql_query(“create table `”.$mysql_table_prefix.”query_log`(
The author of Sphider was kind enough to include a straight SQL script for table creation as an alternative way to create the tables. The pertinent part looks like this:
create table query_log (
ENGINE = MYISAM;
The problem is in the creation of the “time” column. “time timestamp(14)” wouldn’t work on my system so I made a modification to create the column this way:
time TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
That allowed the table to be created and all works fine so far.
Indexing the Website
The next step is to use the admin interface to index the website and fill all the tables with links, keywords etc. The admin interface is located at /sphider/admin/admin.php on a typical installation. After logging in, you go to the “Index” tab, provide the URL for your home page, click off “Full” to chase URL’s as deep as they go on your site and “Start Indexing.”
At that time I had an osCommerce store set up as part of my website (it has since been converted to WordPress using WooCommerce). Sphider dutifully indexed it and created a problem. Each link that Sphider stored in the database had an osCsid attached to it. That’s a long string of numbers/letters that keeps track of a user as they browse around the store. It looks to me that in indexing the store, Sphider opens every link, including the “Add to Cart” links. Thus, putting everything in the store into the user’s cart. When I searched for something and it had a reference in the store and I clicked the provided link, my cart was pre-filled with everything in the store. While that might seem like a good way to increase sales, it will mostly produce unhappy customers. Besides, we don’t want multiple customers browsing the store with the same osCsid. Since the store has its own, built-in search engine, I used the Index > Advanced Options to say that the URL must not include “osCsid”; that resolved the problem. I had a couple of other parts of the site that I didn’t want to be indexed, so I added them to the “must not include” list.
I want my tables to be clean and I’m not fully familiar with Sphiders reindexing process so I emptied the tables before indexing with my added limitations. I did that with an SQL script that looks like this:
DELETE FROM sites;
DELETE FROM links;
DELETE FROM keywords;
DELETE FROM link_keyword0;
DELETE FROM link_keyword1;
DELETE FROM link_keyword2;
DELETE FROM link_keyword3;
DELETE FROM link_keyword4;
DELETE FROM link_keyword5;
DELETE FROM link_keyword6;
DELETE FROM link_keyword7;
DELETE FROM link_keyword8;
DELETE FROM link_keyword9;
DELETE FROM link_keyworda;
DELETE FROM link_keywordb;
DELETE FROM link_keywordc;
DELETE FROM link_keywordd;
DELETE FROM link_keyworde;
DELETE FROM link_keywordf;
DELETE FROM categories;
DELETE FROM site_category;
DELETE FROM temp;
DELETE FROM pending;
DELETE FROM query_log;
DELETE FROM domains;
After re-indexing, Sphider was returning the results I expected. It was not, however, indexing PDF files. Well, my old engine didn’t either, but Sphider offers the possibility through independent PDF to Text software. I posted a ticket with my website host, within 10 minutes they advised me that they already had pdftotext software installed and they provided me with the path to it. I entered that path in sphider/settings/conf.php, set index_pdf to 1 and reindexed the site – all PDFs were indexed as well. If your website host can’t give you this level of service, contact me and I’ll move you to a new host that can. I also turned off the ability to index meta keywords. Many of these were not page specific and generated improper finds during a search.
Styling the Search Page
The default styling is pretty basic and will not look much like the rest of your website. As you can see, I have modified the html so the page fits with the rest of my site. That was pretty easy to do since the developer of Sphider uses a template system. search.php just calls in the parts for a header, the search form and a footer and places them properly. I simply went to templates/standard/header.html and templates/standard/footer.html and added my standard header and footer code. I also made a couple of modifications in search.php to hardwire my modified header and footer into the system:
- Replace include “$template_dir/$template/header.html”; with include ‘templates/standard/header.html’;
- Replace include “$template_dir/$template/footer.html”; with include ‘templates/standard/footer.html’;
Sphider offers a number of other possibilities that I did not explore.
- You can tell Sphider to index a variety of websites. This might be handy if you wanted to create a system to search for information on cars, for example. You could index cars sites from around the web, then assign them to categories by make to make it easier for your users to find the information they want.
- You can index MS Word, Power Point and Excel documents if you have a need.
- You can configure Sphider for more complex searches using AND and OR.
- Sphider provides statistics about your site like: the most commonly found keywords, so you could exclude them from future indexing if they weren’t helpful; the largest pages on your site (kb); the most popular searches, so you can add content to help your visitors; a search log so you can know when people were searching for different things and how much your search engine is getting used; and a list of the size of each of your database tables.
- Sphider lists all the URL’s it scans during indexing. Sometimes they will be dead links and Sphider will indicate that by saying that they have fewer than ten words on them. Image-based PDF’s will give that same indication.
All in all it’s a pretty good system. The official documentation is very sparse, but there is a forum that may be able to help if you have questions. I generally Google “sphider whatever my issue is”. Since there don’t seem to be any other Sphiders out there, I get right to the help I need.