Filters


(Based on Fred Cohen's guide. Updated by Jargoon, edited by Leto.)

Introduction

Filters gives you control over what files HTTrack will or will not download. You can both expand and restrict the access to websites you are trying to mirror.

Whenever you make a mirror of a website HTTrack tries to download everything inside the starting directory and in any sub-directory associated. It ignores anything outside that domain by default. If you need any different behaviour, you may wish to have a try with filters: they will let you add some parts of other websites, or deny certain sub-directories of current website, and offer an opportunity to get only certain kind of files.

Important: It's important to remember that filters help control HTTrack and only apply to pages and files that it discovers when crawling through websites defined in your Start URL settings.

Syntax

To include certain kind of things use a plus sign (+). To exclude anything you don't need, use a minus sign (-). Asterisks (*) work like wildcards to match any number of characters.

Example:

+www.all.net/i_want_this/*
-www.all.net/i_dont_want_that/*
+*.edu.au/*.jpg

Priorities

A list of filters works from least important to most important (latter filters take precedence over previous filters).

In the following example, even though a restriction has been added with the minus filter, all GIF files found will be downloaded because the second filter is overriding the first filter (more specifically, the first filter is not even applied due to the second filter):

-*/images/*.gif
+*.gif

However in this example, the two filters work together nicely. The first filter is initially allowing GIF files from any domain/server, but the second filter restricts that to deny any GIF files that are inside an "images" directory. Another way to think of what these filters are doing is: if a GIF file (on any domain) is not in an "images" directory then permit it to be downloaded

+*.gif
-*/images/*.gif

Examples of filter types

File extension filters

These filters will control certain files depending on their extension (e.g. zip, gif, tgz, pdf, mpeg).

+www.example.com/*Will download everything inside the website, wherever they are found
+*.com/*Will download everything inside any website with a .com as top level domain
-www.example.com/ads/*Will not download any file inside a folder called "ads"
+www.example.com/images/*.jpegWill download just "jpeg" files inside a folder called "images" (note "jpg" inside that folder will not be downloaded at all)
+www.example.com/images/*.jpeg
+www.example.com/images/*.jpg
+www.example.com/images/*.gif
+www.example.com/images/*.png
+www.example.com/images/*.bmp
Collection of filters that will hopefully download every image (at least the most common ones) inside a folder called "images"
-*.htmlThis is a good way to exclude all pages, because without "html" files there are no links, and therefore there are no page downloads
+*.htmlOn the other hand, if you add this filter you will lead to an capture of every "html" file of almost every website on the web
+*.html*[]Just similar to the previous one, but excluding any "html" file with added characters at the end of extension (such as in dynamic links with parameters, as in www.example.com/index.html?page=10)
-*Disallow every page and file. Very useful as the first filter, and to then build upon with additional filters

File size filters

After a file has been added to the download queue, you still have an opportunity to abort the actual download. You can control the size range of your downloaded files. But beware: this will be only helpful if server sends correct image size before downloading the file. If that's not the case, the file will be downloaded and deleted afterwards.

-*[<10]Any file will be rejected if its size is smaller than 10KB
-*[>50]Any file exceeding 50KB will be rejected
-*[<10] -*[>50]
-*[<10]*[>50]
-*[<10>50]
Three ways of do the same thing: every file smaller than 10KB or greater than 50KB will be rejected
-*.gif*[>15]Every gif file greater than 15KB will be rejected (useful for thumbnails?)
+*.gif -*.gif*[<500>1000]
+*.jpg -*.jpg*[<500>1000]
+*.jpeg -*.jpeg*[<500>1000]
+*.png -*.png*[<500>1000]
Will only accept gif, jpg, jpeg and png images with file sizes ranging from 500KB to 1000KB

MIME filters

These are important for dynamically generated websites (pages created with PHP, ASP, CGI, etc): the file extension filters rely on the assumption that HTTrack knows the type of file it is downloading. That's not the case with dynamic pages as the file may be downloaded before its file type is "applied". The MIME type of the file hopefully sent by the server will do the trick.

Important: MIME type checking is always done over links currently added to download queue (the files that have passed other filters), so rules such as -mime:*/* +mime:text/html +mime:image/gif will only polish other filters' results up.

Examples
-mime:application/*Will cancel any file of type application that is queued
-mime:application/pdfWill cancel any file of type/subtype "application/pdf" that is queued
-mime:application/* +mime:application/pdfContrary of previous filter: will cancel any queued file pertaining to application type except pdf files
-mime:*/* +mime:text/html +mime:image/*Cancel any file that is not an html or image file. Beware: in spite this filter does the job, it's not the most efficient one, because files will be cancelled after all queries has been sent and answers received, causing useless load to the server. It's better to add other rules (based on file extensions) and maybe use these ones just in case.
Common MIME types
Data content descriptionTypical extensionsMIME type/subtype
HTML text html htm text/html
Plain text txt text/plain
SGML document sgml text/sgml
Cascading Style Sheet css text/css
Image: GIF gif image/gif
Image: JPEG jpeg jpg jpe image/jpeg
Image: Microsoft Bitmap bmp image/x-ms-bmp
Image: Portable Network Graphics png image/x-png
Image: TIFF tiff tif image/tiff
Sound file: Microsoft wav audio/x-wav
Sound file: MPEG mpa abs mpega audio/x-mpeg
Sound file: MPEG-2mp2a mpa2audio/x-mpeg-2
Sound file: Realaudio (Progressive Networks)ra ram application/x-pn-realaudio
Sound file: MIDI mmid x-music/x-midi
Video: MPEG mpeg mpg mpe video/mpeg
Video: MPEG-2mpv2 mp2v video/mpeg-2
Video: Macintosh Quicktime qt mov video/quicktime
Video: Microsoft avi video/x-msvideo
PostScript ai eps ps application/postscript
Microsoft Rich Text Format rtf application/rtf
Adobe Acrobat (PDF)pdf application/pdf
application/x-pdf
Java file jar application/java-archive
Java class class application/java-vm
Compressed file: Gnu tar gtar application/x-gtar
Compressed file: BSD4.3 tar tar application/x-tar
Compressed file: Zip zip application/zip
Javascript program js ls mocha text/javascript
application/x-javascript
VBScript program text/vbscript
Perl program pl application/x-perl
Macromedia Shockwave application/x-director
Microsoft PowerPoint presentation ppz application/mspowerpoint
ppt application/vnd.ms-powerpoint
Binary, UUencoded bin uu application/octet-stream
PC executable exe application/octet-stream
Undefined binary data (usually executable programs)application/octet-stream

Resources