Version 1.0
-
Graphical user interface is introduced giving the environment for easier configuration
development and testing.
html-to-xml
processor, which is based on HtmlCleaner, now exposes attributes
for controlling cleaner's behaviour.
-
Besides
BeanShell
scripting engine, two others are added: Groovy and JavaScript.
Now it is possible to choose the favourite scripting engine or even mix them in a single
Web-Harvest configuration. This option is supported by adding new attributes to
config
, script
and template
processors.
-
Access to HTTP client is supported by introducing implicit context varibale
http
.
Now it is possible to check important HTTP response values, like
http.mimeType
, http.headers
, http.statusCode
,
or even to obtain instance of org.apache.commons.httpclient.HttpClient
class
with http.client
and manipulate it in the runtime.
-
New attribute
cookie-policy
added to the http
processor,
specifying the way HttpClient manage cookies.
-
Command-line use is improved by adding several new parameters.
-
For more comfortable use of Web-Harvest context variables in the script engines'
runtime scopes, several handy methods are added to the class
org.webharvest.runtime.variables.Variable (interface
IVariable
in
previous versions of Web-Harvest).
-
Several useful methods added in implicit Web-Harvest context variable
sys
,
like sys.xpath(expression, xml)
, sys.isVariableDefined(varname)
and sys.defineVariable(varName, varValue, [overwrite])
.
-
Attribute
overwrite
added in the ver-def
processor,
giving possibility to specify whether existing variables with specified name
will be overwriten or not.
-
New proccessor
<exit condition=... message=.../>
is introduced
in order to support conditional execution break.
-
Encoding selection in
http
processor is changed - if no explicitely
specified with charset
attribute, one given from HTTP response is used
instead to read downloaded text content.
-
NTLM proxy authentication scheme is supported.
-
Performance improvements and bug fixes.
Version 0.5
-
html-to-xml
parser is changed - HtmlCleaner is used instead of TagSoup. The
bad point in this is that some existing Web-Harvest configurations may need
corrections of XPath or XQuery processors. On the other hand, lot of problems
previously existing are now solved.
-
Script processor is introduced. It adds scripting support based on
BeanShell scripting language.
-
template
processor is now based also on BeanShell instead of
OGNL, this way giving possibilty to share the same variables and methods
with script processing.
-
Optional attribute
type
is now added to xq-param
defaulting to node()
. It specifies type of external XQuery
parameter. Up to the Web-Harvest 0.5 this parameter was implicitely declared
at the beginning of XQuery expression and was always of node()*
type. Now on, for each parameter defined
with xq-param
the matching explicit declaration inside xq-expression
is required (declare variable $var_name as var_type external;
).
-
A couple of new constructors is added to the class
ScraperConfiguration
allowing loading configuration from URL or from arbitrary input stream.
-
file
and include
processors now support both absolute
and relative paths. File paths are regarded as absolute if they begin with X:
,
/
, or \
, where X is a letter.
-
In order to avoid ambiguity in exchanging values with
script
and template
processing, Web-Harvest variables are case-sensitive
from this version.