Darcy Ripper User Manual
Darcy Ripper User Manual
Darcy Ripper User Manual
DARCY
RIPPER
Darcy Ripper
www.darcyripper.com
Contents
1
Overview ............................................................................................................................................... 3
1.1
About............................................................................................................................................. 3
1.1
2.2
2.3
Utilities .......................................................................................................................................... 7
2.3.1
History ................................................................................................................................... 7
2.3.2
2.4
2.5
3.1.1
3.2
3.3
3.4
Cookie Settings............................................................................................................................ 13
3.5
3.6
Basic Rules................................................................................................................................... 14
3.7
3.8
2014-08-11
Darcy Ripper
Overview
1.1 About
Darcy Ripper is a powerful pure Java multi-platform web crawler (web spider) with great work
load and speed capabilities. This is a standalone multi-platform Graphical User Interface
application that can be used by simple users as well as programmers to download web resources
on the fly.
Based on proven Java technology, the intuitive Darcy GUI is easy-to-use and provides robust
functionality for creating and running simple or complex download jobs.
Multi-platform;
Real-time view of the download job progress;
Statistics reports;
Pause/Resume/Stop download job anytime;
Save and Load download job template files;
Regular Expression Editor;
Check for Updates support;
Online Help and support.
2014-08-11
HTTP/HTTPS support;
GZip compression support;
HTTP Proxy support;
WWW Authentication support;
Cookies support;
Request customization support: referral behavior, configurable agent name;
HTTP response code analysis and configurable behavior;
3
Darcy Ripper
Connection limits support: number of maximum connections per server, retries number
control, bandwidth limitation, limitation depending on the HTTP response code.
2014-08-11
Darcy Ripper
GUI Overview
Darcy Ripper offers an intuitive and robust interface that makes it easier to create, load and run
download jobs (Job Packages) in a transparent and secure manner.
2014-08-11
Darcy Ripper
Utilities Menu
The Utilities menu includes the following commands:
History
Shows the Job Package activity history;
Regular Expressions Editor
Starts the regular expressions editor dialog.
Help Menu
The Help menu includes the following commands:
Help
Launches the Darcy Ripper Help dialog;
Send Feedback
Opens the default system browser and launches the Darcy Ripper Feedback URL;
Check for Updates...
Checks if there are any Darcy Ripper updates available for download;
About
Provides a few details regarding the current Darcy Ripper application.
Tool bar
The application's tool bar contains the following commands:
New
Creates a new Job Package and launches the Job Package Configuration dialog;
Open...
Opens an existing Job Package file and loads its configuration;
Edit
Opens the Job Package Configuration dialog for the current selected Job Package;
Save
Saves the current selected Job Package to a file;
Save As...
Saves the current select Job Package to a different file;
Save All
Saves all the opened Job Packages to files;
Start
Starts processing the download of the current selected Job Package;
Pause
Pauses the current running download process;
Stop
Stops the current running download process;
Clear
Clears the data associated with the current selected Job Package.
2014-08-11
Darcy Ripper
2.3 Utilities
2.3.1
History
This facility makes it easier for the user to examine past statistics obtained by running Job
Packages. This section contains all the history of Job Packages and each of this processes may be
analyzed in detail by double-clicking them.
At this moment, there is no way of cleaning this history.
Note: The Job Package information displayed here is the information gathered at the moment
when the Job Package has been started (or ended) and this information is not persistent. This
means that if the name of the Job Packages changes, this action will not reflect itself in this
history statistics.
2.3.2
This internal tool makes it easier for the user to control the regular expression that he uses in the
Job Package configuration process.
Regular expressions syntax
Examples
.*sometext.*
Darcy Ripper
com
Darcy Ripper
be required.
2014-08-11
Darcy Ripper
This section refers to the most basic Job Package configuration parameters. Most of these
parameters are mandatory and pretty important.
The available basic settings are:
Name
Defines the Job Package name. Multiple Job Packages can have the same name but we
do not recommend this approach because it will lead to confusion in organizing these
Job Packages. This is a mandatory field;
URL(s)
One (or more) URL from which the Job Package will start its processing. The URL(s)
specified here must be valid (according to the RFC #3986) otherwise Darcy will signal the
invalidity with an error. Multiple URLs can be added by pressing the "Add..." button. This
is a mandatory field;
Save Path
The absolute path of the directory where downloaded resources must be saved. In order
for a Job Package download process to work as expected, enough rights must be given to
Darcy in order for it to be able to write files at this location. This is a mandatory field.
Additional Settings
Download Entire Website
By means of this option, the download process will make sure not to leave the websites
domain (e.g. will not try to download Facebook pages if such links are found). We
recommend always having this option checked. By default this option is checked.
Concurrency Settings
Parallel Downloads
The maximum number of parallel downloads that can run at a certain moment of time.
This is mandatory field.
Memory Settings
These settings will help save a lot of memory as download information will not be kept in the
application's memory.
Drop Ignored Links
If checked, all the links that have been ignored (rules not satisfied, limits impose etc.)
will be removed from memory and they will not be present in the overall results;
Drop Finished Links
If checked, all the links that have been downloaded and processed completely will be
removed from memory and they will not be present in the overall results.
2014-08-11
10
Darcy Ripper
3.1.1
This section offers the possibility of setting up multiple URLs from which the Job Package will
start its processing.
For each of the URLs defined here all the other Job Package settings are valid, meaning that
there cannot be set different rules for each of the URLs defined here. In order to achieve this
multiple Job Packages must be defined.
Processing multiple Job Packages at a single moment of time is not supported at this moment,
but we are working at it.
Each URL must be defined on a single line.
Note: The URLs list must not be empty.
Note: Each URL must be valid (according to the RFC #3986) otherwise Darcy will signal the
invalidity with an error.
2014-08-11
11
Darcy Ripper
User Password
The user password authentication detail to be used for proxy server connection.
Request Settings
Send Referral
Signals that the "Referral" request header must be added to the sent requests;
User Agent
Defines the "User-Agent" request header that must be added to the sent requests. By
default, Darcy Ripper uses its own value for this request header, but there are Web
Server which require specific user agent value to work as expected. Some examples are:
- Mozilla Firefox 20
Mozilla/5.0 (Windows
Firefox/20.0
NT
6.1;
WOW64;
rv:20.0)
Gecko/20100101
- IE 9
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64;
Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET
CLR 3.0.30729; Media Center PC 6.0)
Bandwidth Limit
Bandwidth Limit
Defines the bandwidth that Darcy Ripper must not exceed during the Job Package
download process. This limitation applies to a single download thread.
2014-08-11
12
Darcy Ripper
Even if the domain and path are not added to the header value, these fields are mandatory in
order for Darcy to decide what cookies must be used for which host.
2014-08-11
13
Darcy Ripper
Defines the action that must be took in case all the conditions are met. At this moment
there are three possible such actions:
- Ok: The request is considered finished with success;
- Retry: A retry request must be issued;
- Failed: The request is considered finished with error.
Delay
The delay period (in milliseconds) that must pass before the retry request must be
issued.
The HTTP response status codes referred above represent the basic settings filtering criteria for
these settings. HTTP standard defines some rules that must be respected by the Web Server with
regard to these codes. We will depict next some of these specifications and status codes:
1XX Informational
The codes of this class refer to information messages (header settings or process progress) and
must not be associated with HTTP/1.0 servers;
2XX Success
The codes of this class signal to the client that the server received the request, understood it and
successfully processed it. Some of the important codes of this class are 200 OK, 204 No
Content and 206 Partial Content;
3XX Redirection
The codes of this class inform the client that at least another request must be performed in
order to it to receive the request. The most encountered status code of this class is 301 Moved
Permanently;
4XX Client Error
The codes of this class signal to the client that an error was encountered, error originated from
the client side. Some of the most encountered codes of this class are 400 Bad Request, 401
Unauthorized, 403 Forbidden and 405 Request Timeout;
5XX Server Error
The codes of this class signal to the client that an error was encountered, error originated from
the server side. Some of the most encountered codes of this class are 500 Internal Server
Error, 502 Bad Gateway and 503 Service Unavailable.
For more details with regard to the HTTP specifications with regard to the Response Status Codes
you can further read:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=40132
14
Darcy Ripper
Links Limit
Defines the maximum number of links that must be followed during the Job Package
download process. When this limit is reached then the download process will stop;
Time Limit
Defiles the maximum time (in milliseconds) that a Job Package download process must
not exceed. When this time limit is reached then the download process will stop.
File Size Filter
By means of these settings, the decisions can be made depending on the web resources file size.
For example, in order to avoid downloading large files, these settings may be used.
The available file size filter properties are:
File Size (from)
Defines the start value of the file size interval, from which files will be considered by this
filter;
File Size (to)
Defines the end value of the file size interval, from which files will be considered by this
filter;
Reply 'Content-Length' not available action
Defines the actual action that must be took for the file whose size if between File Size
(from) and File Size (to). At this moment the two possible values are:
- Save To Disk: saves the file to disk;
- Reject File: rejects that particular file and will not download it.
URL Prefix Filter
URL Prefix Filter
Defines, yet another, method of limiting the followed links. Thus, only URLs which begin
with this value will be followed.
2014-08-11
15
Darcy Ripper
2014-08-11
16
Darcy Ripper
There can be defined three possible actions for a filter:
No Change
No action will be taken. This represents the default action;
Accept
The request is considered valid and will be processed;
Reject
The request is considered invalid and will not be processed.
2014-08-11
17