(Based on Fred Cohen's guide. Updated by Jargoon, edited by Leto.)
Filters gives you control over what files HTTrack will or will not download. You can both expand and restrict the access to websites you are trying to mirror.
Whenever you make a mirror of a website HTTrack tries to download everything inside the starting directory and in any sub-directory associated. It ignores anything outside that domain by default. If you need any different behaviour, you may wish to have a try with filters: they will let you add some parts of other websites, or deny certain sub-directories of current website, and offer an opportunity to get only certain kind of files.
Important: It's important to remember that filters help control HTTrack and only apply to pages and files that it discovers when crawling through websites defined in your
Start URL settings.
To include certain kind of things use a plus sign (+). To exclude anything you don't need, use a minus sign (-). Asterisks (*) work like wildcards to match any number of characters.
+www.all.net/i_want_this/* -www.all.net/i_dont_want_that/* +*.edu.au/*.jpg
A list of filters works from least important to most important (latter filters take precedence over previous filters).
In the following example, even though a restriction has been added with the minus filter, all GIF files found will be downloaded because the second filter is overriding the first filter (more specifically, the first filter is not even applied due to the second filter):
However in this example, the two filters work together nicely. The first filter is initially allowing GIF files from any domain/server, but the second filter restricts that to deny any GIF files that are inside an "images" directory. Another way to think of what these filters are doing is: if a GIF file (on any domain) is not in an "images" directory then permit it to be downloaded
These filters will control certain files depending on their extension (e.g. zip, gif, tgz, pdf, mpeg).
|Will download everything inside the website, wherever they are found|
|Will download everything inside any website with a .com as top level domain|
|Will not download any file inside a folder called "ads"|
|Will download just "jpeg" files inside a folder called "images" (note "jpg" inside that folder will not be downloaded at all)|
|Collection of filters that will hopefully download every image (at least the most common ones) inside a folder called "images"|
|This is a good way to exclude all pages, because without "html" files there are no links, and therefore there are no page downloads|
|On the other hand, if you add this filter you will lead to an capture of every "html" file of almost every website on the web|
|Just similar to the previous one, but excluding any "html" file with added characters at the end of extension (such as in dynamic links with parameters, as in |
|Disallow every page and file. Very useful as the first filter, and to then build upon with additional filters|
After a file has been added to the download queue, you still have an opportunity to abort the actual download. You can control the size range of your downloaded files. But beware: this will be only helpful if server sends correct image size before downloading the file. If that's not the case, the file will be downloaded and deleted afterwards.
|Any file will be rejected if its size is smaller than 10KB|
|Any file exceeding 50KB will be rejected|
|Three ways of do the same thing: every file smaller than 10KB or greater than 50KB will be rejected|
|Every gif file greater than 15KB will be rejected (useful for thumbnails?)|
|Will only accept gif, jpg, jpeg and png images with file sizes ranging from 500KB to 1000KB|
These are important for dynamically generated websites (pages created with PHP, ASP, CGI, etc): the file extension filters rely on the assumption that HTTrack knows the type of file it is downloading. That's not the case with dynamic pages as the file may be downloaded before its file type is "applied". The MIME type of the file hopefully sent by the server will do the trick.
Important: MIME type checking is always done over links currently added to download queue (the files that have passed other filters), so rules such as
-mime:*/* +mime:text/html +mime:image/gif will only polish other filters' results up.
|Will cancel any file of type application that is queued|
|Will cancel any file of type/subtype "application/pdf" that is queued|
|Contrary of previous filter: will cancel any queued file pertaining to application type except pdf files|
|Cancel any file that is not an html or image file. Beware: in spite this filter does the job, it's not the most efficient one, because files will be cancelled after all queries has been sent and answers received, causing useless load to the server. It's better to add other rules (based on file extensions) and maybe use these ones just in case.|
|Common MIME types|
|Data content description||Typical extensions||MIME type/subtype|
|HTML text||html htm||text/html|
|Cascading Style Sheet||css||text/css|
|Image: JPEG||jpeg jpg jpe||image/jpeg|
|Image: Microsoft Bitmap||bmp||image/x-ms-bmp|
|Image: Portable Network Graphics||png||image/x-png|
|Image: TIFF||tiff tif||image/tiff|
|Sound file: Microsoft||wav||audio/x-wav|
|Sound file: MPEG||mpa abs mpega||audio/x-mpeg|
|Sound file: MPEG-2||mp2a mpa2||audio/x-mpeg-2|
|Sound file: Realaudio (Progressive Networks)||ra ram||application/x-pn-realaudio|
|Sound file: MIDI||mmid||x-music/x-midi|
|Video: MPEG||mpeg mpg mpe||video/mpeg|
|Video: MPEG-2||mpv2 mp2v||video/mpeg-2|
|Video: Macintosh Quicktime||qt mov||video/quicktime|
|PostScript||ai eps ps||application/postscript|
|Microsoft Rich Text Format||rtf||application/rtf|
|Adobe Acrobat (PDF)||application/pdf|
|Compressed file: Gnu tar||gtar||application/x-gtar|
|Compressed file: BSD4.3 tar||tar||application/x-tar|
|Compressed file: Zip||zip||application/zip|
|Microsoft PowerPoint presentation||ppz||application/mspowerpoint|
|Binary, UUencoded||bin uu||application/octet-stream|
|Undefined binary data (usually executable programs)||application/octet-stream|