Release date: November 01, 2012
Build up with Sphider: v.1.3.5
In front of version 2.8 the following modifications have been added:
New feature:
Support for non-ASCII URLs, using 'Internationalized Domain Names' (IDN)
as defied in RFC 3490, RFC 3491 and RFC 3492.
If activated, internationalized domain names like 'http://президент.рф/' and 'http://müller.de/'will be accepted as new sites in Admin backend, as well as in User's addurl form.
New feature:
Support for Punycode URLs like http://xn--90aoqlh7c4a.xn--d1abbgf6aiiy.xn--p1ai/
Will be converted into the readable form http://события.президент.рф/
To be activated in Admin settings.
New feature:
Besides the usual HTML elements <element> , also delete from full text all those HTML elements, which are defined like & l t; element & g t;
To be activated in Admin settings.
New feature:
Index only parts of a page, defined by <element > . . . </element>
This feature is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc
If enabled in Admin settings, the values as defined in the list-file
.../include/common/elements_use.txt will be used to index only the page content between
<element> . . . </element>
For details see chapter Indexing only parts of a page defined by <element> . . . . </element>
New feature:
Ignore parts of a page, defined by <element> . . . </element>
This feature is foreseen to cooperate with the new HTML5 elements like
section, nav, aside, hgroup, article, header, footer, etc.
If enabled in Admin settings, the values as defined in the list-file
.../include/common/elements_not.txt will be used to remove the content between
. . . from the page content.
This is the contrary function to 'Index only parts of a page, defined by <element> . . . </element>'
For details see chapter Ignoring parts of a page defined by <element> . . . . </element>
New feature:
Index only files and documents with defined suffix :
If activated, all pages of the site will be searched for links, but only files with suffixes as defined in the docs list will be indexed.
For details see chapter Index only files and documents with defined suffix
New feature:
1. Perform a WHOIS check for sites waiting for approval in Admin backend.
2. Perform a WHOIS check for suggested URLs direct in the addurl form, so that invalid URLs will automatically be rejected.
For both tests a basic list of WHOIS servers for the generic top level domains and some important country codes (supporting 30 suffixes), or an extended list (supporting 155 suffixes) are selectable.
New option to be activated in Admin backend:
Crawler can leave domain during index procedure, but only for canonical links.
Only the canonical link will be indexed, but links found there will be ignored.
New feature:
Obey the 'refresh' meta tags as part of HTML headers.
Now following the redirection and delayed indexing.
New option:
Support UTF-16 coded sites. Will convert UTF-16 coded sites into UTF-8.
To be activated in Admin settings
New option:
For index procedure always use the standard Firefox HTTP_USER_AGENT string and ignore the individual defined Sphider-plus string. To be activated in Admin backend.
New feature
Follow redirections, which are invoked by JavaScript, when sent as HTTP content.
Will obey directives like:
<SCRIPT language="javascript">window.location="mp.php?mcv=59"; </SCRIPT>
New feature:
Follow URL redirections caused by HTTP 301, 302, 303 and 307 status codes.
New feature:
Separated PDF converter supplied for 32 and 64 bit Operating Systems.
For details, please notice chapter PDF converter for Linux/UNIX systems
New feature:
Follow links placed in JavaScript files. Will detect and follow links like
document.write(' <a href="new_12.pdf">All news 2012</a> ');
Also the complete content of
document.write( this text in all rows');
will be indexed and stored as keywords in db.
New feature:
Now indexing also sites, which do send a obligatory request for a cookie, to be set by the crawler.
New feature:
In order to reduce transmission time, the crawler now requests gzip-formatted data transfer from the remote server for the URL to be indexed.
New option:
In order to convert the text into UTF-8, use the charset definition as supplied via HTTP by the client server.
If this option is not activated in Admin Settings, the charset will be extracted from the header of the files to be indexed. If not found, like in PDF documents, the preferred charset will be used.
New option:
Delete duplicate parts of the URL path found in the indexed page URL and the new links.
Unfortunately some CMS seem to be unable to build up a correct path for relative links.
If activated in Admin backend, these duplicate parts of the path will be deleted from the link URL. Should be activated only, if sites are indexed created by dedicated CMS.
New feature:
Show summary of actually active User database at the bottom of result listing.
To be activated in Admin backend, the count of sites, categories, page links and keywords are displayed.
New feature:
Automatically deleting invalid URLs from Admin 'Sites' view.
Improved 'Add site' function in Admin backend.
Now treating URLs with and without 'www' as equal, and excluding them as duplicate sites.
Improved image indexing procedure
Now also indexing phpBB images, linked by php command files.
New option
Suppress the file suffix from image file names for indexing.
Improved media indexing procedure
In case of missing title tag, now the alt tag is used to define the name of the media. In case that also the alt tag is missing, the file name will be used as keyword.
Improved "banned domain" management
Now holding name and suffix of the banned domains, and no longer the URLs.
Improved index procedure
Now ignoring links that try to link to the calling URI (self back linking).
Improved link detection for relative links, which are to be found in full text.
Improved input protection against SQL injections
Improved Admin statistics
Now providing also the IP, country code and country name for
- Search log
- Most popular searches
- Most popular page links
- Most popular media links
Updated GeoIP database, used to provide the IP, CC and country name for the Admin statistics. Now also supporting IPv6 URLs.
Support on Windows systems temporary removed for ppt files, as the converter causes failures on large PowerPoint documents.
Bug fixed, which prevented category selection without activating the "Advanced search form" option.
Bug fixed that caused invalid URL encoding in result listing.
Bug fixed causing the error output "Unknown column 'naame' in field list" during media indexing.
Bug fixed that caused MySQL warning messages during index procedure at some older MySQL versions, if the URL to be indexed contained blank characters.
Bug fixed, which caused invalid URL creation for relative links containing a file name and/or query.
Bug fixed in option 'Crawler can leave domain'.
Bug fixed in option 'Use list of div ids to ignore the div content during index/re-index'.
Bug fixed in option 'Enable to decode entity coded sites into standard HTML characters'.
Bug fixed in 'addurl' form, which prevented input of words containing accents in 'title' and 'description' fields.
Some additional small bugs killed.
Involved files that have been modified / added for this release:
.../addurl.php
.../admin/admin.php
.../admin/admin_header.php
.../admin/admin_search.php
.../admin/auth.php
.../admin/auth_bypass.php
.../admin/auth_db.php
.../admin/configset.php
.../admin/db_activate.php
.../admin/db_config.php
.../admin/db_main.php
.../admin/geoip.php
.../admin/GeoIP.dat
.../admin/http.php
.../admin/index_media.php
.../admin/install_tables.php
.../admin/messages.php
.../admin/spider.php
.../admin/spiderfuncs.php
.../admin/url_backup.php
.../converter/feed_parser.php
.../converter/pdftotext32.script
.../converter/pdftotext64.script
.../include/click_counter.php
.../include/commonfuncs.php
.../include/domain_whois.php
.../include/idna_converter.php
.../include/media_counter.php
.../include/search_10.php
.../include/search_40.php
.../include/search_50.php
.../include/search_media.php
.../include/searchfuncs.php
.../include/suggest.php
.../include/common/docs.txt
.../languages/ all files
.../templates/html/020_search-form.html
.../templates/html/090_footer.html
.../templates/html/091_footer.html
Attention: This version requires an updated set of database tables. It is strongly recommended to follow the instructions as described in chapter: "Updating from 2.x to 2.y" for version 2.9