Fork me on GitHub

Manual

The jARVEST's transformer model

Every harverster built with jARVEST is composed by a set of transformers. A transformer is a component which receives a stream of strings (Java String) and outputs a stream of strings.

In addition, concrete transformers can receive parameters. For example, the xpath transformer (see available transformers) needs to be configured with the XPath expression to find in the incoming input Strings (treated as HTML documents).

Each transformer can have children, which are also transformers. The flow of strings to the children depends on the parent policy: cascade (or serial) and branched, which is a common parameter of every transformer.

Basic syntax of one transformer

Example:

Cascade (or serial) connection

The parent can treat its children in a serial fashion, that is, the parent outputs are feeded to the first child; the outputs of the first child are feeded to the second, and so on.

Let's see an example. The following harvester: is the same as this one: The two transformers (wget and xpath) composing the harvester are connected in cascade mode, because there is a hidden parent (added by default), which is configured to connect its children in cascade. So this is also the same harvester: The pipe transformer is a simple transformer which does nothing with the inputs, it only forwards them to its children in cascade mode.

Branched connection

Any transformer can also branch its output among the children. There are two branch modes:

  • Feed all children with its output, known as BRANCH_DUPLICATED branch mode
  • "Scatter" its output among children, known as BRANCH_SCATTERED. The first output of the parent goes to the first child, the second output goes to the second, and so on. If there are more outputs than children (n), the n+1 output goes to the first child again.
Once the outputs are forwarded to the children, every child will produce outputs. How they are merged is known as the merge mode, which could be:
  • All the outputs of the first child, next the outputs of the second, and so on, known as ORDERED merge mode.
  • The first output of every child, next the second output of every child, and so on, known as SCATTERED merge mode.
  • All the outputs SCATTERED and concatenated in a single output, known as COLLAPSED. Collapsing the output can also be done with the merge transformer

The following harvester retrieves all the links (href) and their text from each input url, by generating two consecutive outputs per link. This is done by branching the output of the previous child (wget) among its children: Please note that the branching parameters are common to any transformer, so this harvester could be also rewritten as:

Loops

Every transformer can be executed in loop mode. This mode means that the transformer can include an specific child (known as the loop controller) which will receive the parent output and, if it returns some output, this output will be feeded back to the parent transformer. If there is no output, the loop ends.

To configure a given transformer in loop mode, you should call the repeat? method at the end of the transformer:

The loop mode is useful, for example, to iterate among paginated results in web pages. For example:

Parameters with input auto-references

Every parameter value can include an special wildcard inside: %%n%% (n is a number >=0). This represents "the value of the input n". The first input is 0. For example, the following harvester: will compare each input (except the first) against the first input.

Variables

You can define global variables with the setvar transformer. It is useful to save values at any time (including the value arriving as an input) and retrieve them in the future as any parameter of any other transformer. Example:

Filtering inputs

You can filter some inputs in any transformer. These inputs will be ignored and consumed, so they will be not passed to the next transformers. This can be done with the inputFilter parameter, which receives a string representing which inputs should be ignored (example1: 1,5,6 example2: 0-10 example3: 3-)

Available transformers

TransformerDescription / Parameters
wget For each input string 's', performs an GET HTTP request to the URL 's' and returns its contents a new output.
userAgent (string)
The user agent (default: jARVEST's default user agent)
ajax (true | false)
Add the "X-Requested-With: XMLHttpRequest" header to the HTTP request. Some ajax servers will do a satisfactory response only if this header is present (default: false)
headers (string)
Additional headers for the request, specified in JSON format. Example: {"Accept-Charset": "iso-8859-5, unicode-1-1;q=0.8"} (default: none)
binary (true | false)
Return the output of the url as a base64-encoded string. It is useful to safely download binary contents as strings. Note: you have to decode the base64 string. (default: none)
xpath For each input string 's', treat it as HTML by building its DOM tree and run a given XPath expression. Each matched content will be returned as a new output.
XPath (string)
An xpath expression (required)
ifNoMatch (string)
Output a given string if there is no match in the input 's' (default: no output)
addTBody (true | false)
Add the tbody tag inside tables when parsing HTML (default: true)
htmlClean (true | false)
Clean the HTML before parsing (corrects syntax). If you are sure that your input is a valid HTML or XML document, set this to false (default: true)
xpathscrap For each input string 's', treat it as HTML by building its DOM tree and run a given XPath expression. The whole inner XML of each matched content will be retrieved.
XPath (string)
An xpath expression
ifNoMatch (string)
Output a given string if there is no match in the input 's' (default: no output)
addTBody (true | false)
Add the tbody tag inside tables when parsing HTML (default: true)
htmlClean (true | false)
Clean the HTML before parsing (corrects syntax). If you are sure that your input is a valid HTML or XML document, set this to false (default: true)
select For each input string 's', treat it as HTML, select nodes with a given CSS selector expression. For each matched node, a) the inner combined text (default), b) an specified attribute or c) the inner HTML, can be returned.
selector (string)
An CSS selector expression. Example: "table#results td"
ifNoMatch (string)
Output a given string if there is no match in the input 's' (default: no output)
attribute (string)
Output the specified attribute, instead of inner text (default: none)
innerHTML (true | false)
Output the innerHTML, instead of the inner text (default: false)
decorate For each input string 's', generate a new output by prepending a 'head' and appending a 'tail'.
head (string)
A string to prepend (default: none)
tail (string)
A string to append (default: none)
match For each input string 's', matches a regular expression with only one capture (between parenthesis). Each captured result will be returned as a new output.
pattern (string)
A regular expression (required)
ifNoMatch (string)
Output a given string if there is no match in the input 's' (default: no output)
append All input strings are returned as a new output (if any), plus a given additional output string at the end.
append (string)
The string to append (default: "")
replace For each input string 's', generate a new output by replacing each match of a regular expression with a given string.
sourceRE (string)
regular expression (required)
dest (string)
replacement string (required)
compare For each input string 's', compares it with a given value 'v' as String|Date|Number, and generates new output by prefixing the input with a different prefix if 's' is less, equals, greater than 'v', or an error has been produced in comparison.
compareWith (string)
the value to compare to
compareAs ("Date" | "String" | "Number")
Perform the comparison treating values as the given data type
prefixIfGreater (string)
Prefix if 's' is greater than 'v'. Default: "_GREATER_"
prefixIfLess (string)
Prefix if 's' is less than 'v'. Default: "_LESS_"
prefixIfEqualsr (string)
Prefix if 's' is equals than 'v'. Default: "_EQUALS_"
prefixIfError (string)
Prefix if there is an error when comparing 's' and 'v'. Default: "_ERROR_"
merge Collapses all inputs as a single output.
post Performs a POST HTTP request to an URL given as parameters. The output of this harverster can be (i) the input strings with no transformations or (ii) the output of the server as a single output (inputs are ignored).
Note: All returned cookies will be kept during the rest of the harvester execution (including further wget/post requests). In other words, you can use this harverster to perform login on cookie-based session sites.
URL (string)
The URL to perform the POST (required)
queryString (string)
The query string that will be on the POST's body. Example: "user=foo&pass=bar" (required)
querySeparator (string)
Separator of each request parameter (default: "&")
outputHTTPOutputs (true | false)
Give the server response body as output (default: false)
userAgent (string)
The user agent (default: jARVEST's default user agent)
ajax (true | false)
Add the "X-Requested-With: XMLHttpRequest" header. Some ajax servers will do a response if this header is present (default: false)
headers (string)
Additional headers for the request, specified in JSON format. Example: {"Accept-Charset": "iso-8859-5, unicode-1-1;q=0.8"} (default: none)
binary (true | false)
Return the output of the url as a base64-encoded string. It is useful to safely download binary contents as strings. Note: you have to decode the base64 string. (default: none)
pipe A simple transformer (does not transform the data), but useful for grouping a set of children in Serial connection.
branch A simple transformer (does not transform the data), but useful for grouping a set of children in Branched connection.
first parameter
The branch mode: BRANCH_DUPLICATED or BRANCH_SCATTERED (required)
second parameter
The merge mode: SCATTERED or COLLAPSED or ORDERED (required)
Example:
one_to_one Treats each input of the parent independently, and ensures only <=1 outputs per input. That is, each output of the parent will be forwarded to the child block one at a time. The child block's outputs will be collapsed before the next output of the parent is forwarded again.

For example, if we have multiple input sites and we want to make an xpath query over each site. Each xpath query could return more than one output, so if we want to keep the correspondence between each input url with their xpath query results, we must use the one_to_one approach.

Example:
setvar Defines a "global variable" with a given name and value. The variable can be retrieved afterwards with %%varname%%. This transformer does not modify the inputs, it only forwards them.
name
The name of the global variable
value
The value of the global variable. Hint: You can also use %%number%% to put a desired input as a variable (see Parameters with input auto-references)
Example.