Fork me on GitHub

Latests news

  • 2013-02-04. jARVEST-web. We are developing a Web Frontend for jARVEST. Please check it out at GitHub.
    Features:
    • Remote jARVEST Engine. You can maintain several robots and make it them publicly available through a simple REST API. This is very useful when devoloping internet applications where Java is not available
    • jARVEST mini-IDE. User-friendly environment
    • It is a Java Webapp over a MySQL DB (through Hibernate), tested with Tomcat 7. You need to edit /src/main/resources/hibernate.cfg.xml and set your database name and a valid username/password

What is jARVEST?

jARVEST (Java web harvesting library) is a simple web harvesting or web scraping framework. jARVEST is implemented via a powerful JRuby-based domain specific language (DSL), allowing you to develop harvesters with minimum code.

About the project

jARVEST is developed by Daniel Glez-Peña (engine) at University of Vigo, and Óscar González Fernández (DSL)

Features

  • 100% Java. This facilitates the integration in both client and server projects while preserving the platform-independence.
  • XPath expressions matching. The HTML pages are automatically "cleaned-up" and transformed to XHTML (thanks to htmlcleaner) before running the XPath matching.
  • CSS selector matching. Select HTML nodes with the CSS selector syntax (thanks to jsoup).
  • Form posting and cookie tracking. This allows to login on sites and keep the session during the harvester execution.
  • Complex robot assembly. By combining primitives with serial and parallel layouts, you can build very complex robots with few lines of code.
  • Variable definitions. You can define global variables to capture any input, that can be after used at any point in the harvester, simplifying the code in many situations.
  • Loops. Harvesters can include loops, useful to iterate among paginated results.
  • Stream-based. The harvester is a transformation routine receiving a stream of strings (Java String) and producing a stream of strings. You can get the outputs as soon as they are produced.
  • Free software. Licensed under GNU Lesser General Public License, hosted at Github.
  • Command-line and API. jARVEST can be run from command line, useful to include in shell scripts, and as API, embedded in any Java project.

For impatient users

First, download and install

From command line:

echo "http://www.google.com" | ./jarvest.sh run -p "wget | xpath('//a/@href')"

Inside Java: