Mechanics of updating

Difference (from prior minor revision)

Changed: 4c4,9

< - During an update, HTTrack requests the previously downloaded file, giving to the server the "hint" previously sent (timestamp, and/or ETag). It is the duty of the server to either respond with a "OK, file not modified" message (304), or using a "OOPS, you have to redownload this file" message (200).

to

> * During an update, HTTrack requests the previously downloaded file, giving to the server the "hint" previously sent (timestamp, and/or ETag). It is the duty of the server to either respond with a "OK, file not modified" message (304), or using a "OOPS, you have to redownload this file" message (200).
> With this system, the caching process is totally transparent, and very reliable. That's the theory. Now let's go back to the real world.
> Some servers, unfortunately, are really dumb; they just ignore the timestamp/ETag, or do not give any reliable information the first time. Because of that, (offline) browsers like HTTrack are forced to re-dowload data that is identical to the previous version.
> Sometimes clever servers are also unable to "cleverly handle" stupid scripts that just don't care about bandwidth waste and caching problems. Many websites (especially those with dynamic pages) are therefore not "cache compliant", and browsers will always re-download their data.
> But this is not something a browser can change -- only servers could, if only webmasters were concerned about caching problems.
> There are *always* methods that allow pages, even dynamic ones, to be cached, and even those using cookies and other session-related data.


The way HTTrack renames and saves pages locally does not change the way HTTrack does updates, and does not influence the whole update process. The original remote hostname, filename and query strings are stored in the hts-cache file data, and HTTrack only uses that information to perform the update process.

The majority of the update process is dependant on the remote server, through two important processes:

With this system, the caching process is totally transparent, and very reliable. That's the theory. Now let's go back to the real world.

Some servers, unfortunately, are really dumb; they just ignore the timestamp/ETag, or do not give any reliable information the first time. Because of that, (offline) browsers like HTTrack are forced to re-dowload data that is identical to the previous version.

Sometimes clever servers are also unable to "cleverly handle" stupid scripts that just don't care about bandwidth waste and caching problems. Many websites (especially those with dynamic pages) are therefore not "cache compliant", and browsers will always re-download their data.

But this is not something a browser can change — only servers could, if only webmasters were concerned about caching problems.

There are always methods that allow pages, even dynamic ones, to be cached, and even those using cookies and other session-related data.


Information sourced from forum http://forum.httrack.com/readmsg/3062/index.html