HTTrack Help: Filters

Introduction
Syntax
Priorities
Examples of filter types
Resources

(Based on Fred Cohen's guide. Updated by Jargoon, edited by Leto.)

Introduction

Filters gives you control over what files HTTrack will or will not download. You can both expand and restrict the access to websites you are trying to mirror.

Whenever you make a mirror of a website HTTrack tries to download everything inside the starting directory and in any sub-directory associated. It ignores anything outside that domain by default. If you need any different behaviour, you may wish to have a try with filters: they will let you add some parts of other websites, or deny certain sub-directories of current website, and offer an opportunity to get only certain kind of files.

Important: It's important to remember that filters help control HTTrack and only apply to pages and files that it discovers when crawling through websites defined in your Start URL settings.

Syntax

To include certain kind of things use a plus sign (+). To exclude anything you don't need, use a minus sign (-). Asterisks (*) work like wildcards to match any number of characters.

Example:

+www.all.net/i_want_this/*
-www.all.net/i_dont_want_that/*
+*.edu.au/*.jpg

Priorities

A list of filters works from least important to most important (latter filters take precedence over previous filters).

In the following example, even though a restriction has been added with the minus filter, all GIF files found will be downloaded because the second filter is overriding the first filter (more specifically, the first filter is not even applied due to the second filter):

-*/images/*.gif
+*.gif

However in this example, the two filters work together nicely. The first filter is initially allowing GIF files from any domain/server, but the second filter restricts that to deny any GIF files that are inside an "images" directory. Another way to think of what these filters are doing is: if a GIF file (on any domain) is not in an "images" directory then permit it to be downloaded

+*.gif
-*/images/*.gif

Examples of filter types

File extension filters

These filters will control certain files depending on their extension (e.g. zip, gif, tgz, pdf, mpeg).

`+www.example.com/*`	Will download everything inside the website, wherever they are found
`+.com/`	Will download everything inside any website with a .com as top level domain
`-www.example.com/ads/*`	Will not download any file inside a folder called "ads"
`+www.example.com/images/*.jpeg`	Will download just "jpeg" files inside a folder called "images" (note "jpg" inside that folder will not be downloaded at all)
`+www.example.com/images/.jpeg` `+www.example.com/images/.jpg` `+www.example.com/images/.gif` `+www.example.com/images/.png` `+www.example.com/images/*.bmp`	Collection of filters that will hopefully download every image (at least the most common ones) inside a folder called "images"
`-*.html`	This is a good way to exclude all pages, because without "html" files there are no links, and therefore there are no page downloads
`+*.html`	On the other hand, if you add this filter you will lead to an capture of every "html" file of almost every website on the web
`+.html[]`	Just similar to the previous one, but excluding any "html" file with added characters at the end of extension (such as in dynamic links with parameters, as in `www.example.com/index.html?page=10`)
`-*`	Disallow every page and file. Very useful as the first filter, and to then build upon with additional filters

File size filters

After a file has been added to the download queue, you still have an opportunity to abort the actual download. You can control the size range of your downloaded files. But beware: this will be only helpful if server sends correct image size before downloading the file. If that's not the case, the file will be downloaded and deleted afterwards.

`-*[<10]`	Any file will be rejected if its size is smaller than 10KB
`-*[>50]`	Any file exceeding 50KB will be rejected
`-[<10] -[>50]` `-[<10][>50]` `-*[<10>50]`	Three ways of do the same thing: every file smaller than 10KB or greater than 50KB will be rejected
`-.gif[>15]`	Every gif file greater than 15KB will be rejected (useful for thumbnails?)
`+.gif -.gif[<500>1000]` `+.jpg -.jpg[<500>1000]` `+.jpeg -.jpeg[<500>1000]` `+.png -.png[<500>1000]`	Will only accept gif, jpg, jpeg and png images with file sizes ranging from 500KB to 1000KB

MIME filters

These are important for dynamically generated websites (pages created with PHP, ASP, CGI, etc): the file extension filters rely on the assumption that HTTrack knows the type of file it is downloading. That's not the case with dynamic pages as the file may be downloaded before its file type is "applied". The MIME type of the file hopefully sent by the server will do the trick.

Important: MIME type checking is always done over links currently added to download queue (the files that have passed other filters), so rules such as -mime:*/* +mime:text/html +mime:image/gif will only polish other filters' results up.

Examples
`-mime:application/*`	Will cancel any file of type application that is queued
`-mime:application/pdf`	Will cancel any file of type/subtype "application/pdf" that is queued
`-mime:application/*` `+mime:application/pdf`	Contrary of previous filter: will cancel any queued file pertaining to application type except pdf files
`-mime:/` `+mime:text/html` `+mime:image/*`	Cancel any file that is not an html or image file. Beware: in spite this filter does the job, it's not the most efficient one, because files will be cancelled after all queries has been sent and answers received, causing useless load to the server. It's better to add other rules (based on file extensions) and maybe use these ones just in case.

Common MIME types
Data content description	Typical extensions	MIME type/subtype
HTML text	html htm	text/html
Plain text	txt	text/plain
SGML document	sgml	text/sgml
Cascading Style Sheet	css	text/css
Image: GIF	gif	image/gif
Image: JPEG	jpeg jpg jpe	image/jpeg
Image: Microsoft Bitmap	bmp	image/x-ms-bmp
Image: Portable Network Graphics	png	image/x-png
Image: TIFF	tiff tif	image/tiff
Sound file: Microsoft	wav	audio/x-wav
Sound file: MPEG	mpa abs mpega	audio/x-mpeg
Sound file: MPEG-2	mp2a mpa2	audio/x-mpeg-2
Sound file: Realaudio (Progressive Networks)	ra ram	application/x-pn-realaudio
Sound file: MIDI	mmid	x-music/x-midi
Video: MPEG	mpeg mpg mpe	video/mpeg
Video: MPEG-2	mpv2 mp2v	video/mpeg-2
Video: Macintosh Quicktime	qt mov	video/quicktime
Video: Microsoft	avi	video/x-msvideo
PostScript	ai eps ps	application/postscript
Microsoft Rich Text Format	rtf	application/rtf
Adobe Acrobat (PDF)	pdf	application/pdf application/x-pdf
Java file	jar	application/java-archive
Java class	class	application/java-vm
Compressed file: Gnu tar	gtar	application/x-gtar
Compressed file: BSD4.3 tar	tar	application/x-tar
Compressed file: Zip	zip	application/zip
Javascript program	js ls mocha	text/javascript application/x-javascript
VBScript program		text/vbscript
Perl program	pl	application/x-perl
Macromedia Shockwave		application/x-director
Microsoft PowerPoint presentation	ppz	application/mspowerpoint
	ppt	application/vnd.ms-powerpoint
Binary, UUencoded	bin uu	application/octet-stream
PC executable	exe	application/octet-stream
Undefined binary data (usually executable programs)		application/octet-stream

Resources

Fred Cohen's HTTrack User Guide
Spanish User Guide (Fred Cohen's guide updated and translated into Spanish)

Filters

Contents