Showing revision 5

Filters

(Based on Fred Cohen's guide. Updated by Jargoon, edited by Leto.)

Filters gives you control over what files HTTrack will or will not download. You can both expand and restrict the access to websites you are trying to mirror.

Whenever you make a mirror of a website HTTrack tries to download everything inside the starting directory and in any sub-directory associated. It ignores anything outside that domain by default. If you need any different behaviour, you may wish to have a try with filters: they will let you add some parts of other websites, or deny certain sub-directories of current website, and offer an opportunity to get only certain kind of files.

Let's start by learning three important terms:

Syntax

To include certain kind of things use a plus sign (+). To exclude anything you don't need, use a minus sign (-). Asterisks (*) work like wildcards to match any number of characters.

Example:

+www.all.net/i_want_this/*
-www.all.net/i_dont_want_that/*
+*.edu.au/*.jpg

Priorities

A list of filters works from least important to most important (latter filters take precedence over previous filters).

In the following example, even though a restriction has been added with the minus filter, all GIF files found will be downloaded because the second filter is overriding the first filter (more specifically, the first filter is not even applied due to the second filter):

-*/images/*.gif
+*.gif

However in this example, the two filters work together nicely. The first filter is initially allowing GIF files from any domain/server, but the second filter restricts that to deny any GIF files that are inside an "images" directory. Another way to think of what these filters are doing is: if a GIF file (on any domain) is not in an "images" directory then permit it to be downloaded

+*.gif
-*/images/*.gif

Resources