Time |
Nickname |
Message |
10:40
🔗
|
underscor |
http://archive.org/about/dmca.php I had no idea this was a thing |
10:40
🔗
|
underscor |
\o/ |
16:07
🔗
|
godane |
i got the blazetv doc called The Project |
16:59
🔗
|
dashcloud |
for wget warc do I need to include a header? |
17:06
🔗
|
dashcloud |
apparently you can't use two separate warc headers- everything has to be combined into a single --warc-header command |
17:58
🔗
|
alard |
dashcloud: You should be able to use multiple --warc-header options. |
18:00
🔗
|
alard |
You could do something like wget --warc-header="operator: Archive Team" --warc-header="x-something-else: value" |
18:01
🔗
|
alard |
You can use any header you want, as long as it follows the name: value format. The headers will be stored in the warc-info record at the top of the warc file. |
18:03
🔗
|
dashcloud |
ah- that explains it |
18:03
🔗
|
dashcloud |
I didn't have a colon in the second header command |
18:05
🔗
|
dashcloud |
so how much should I set recursion to in order to avoid infinite loops? |
18:40
🔗
|
alard |
I'm not sure if Wget checks the headers, it might just copy the strings. |
18:40
🔗
|
alard |
Recursion, well, that depends on what you're doing, I guess. |
18:42
🔗
|
alard |
It can be lower for very shallow sites, but must be high for sites with a deep structure. You could also set try to ignore the looping urls with one of the ignore options. |
18:52
🔗
|
dashcloud |
thanks |
19:31
🔗
|
dashcloud |
hi folks, I did a basic grab of touchatag.com using these settings: http://pastebin.com/nzSnPfz7 and it would be great if someone could double check it- I appear to have missed this page: http://www.touchatag.com/downloads and I'm not quite sure how |
20:00
🔗
|
alard |
dashcloud: I'm getting http://www.touchatag.com/downloads , so no idea what's wrong. (You might want to add --page-requisites, but that's something else.) |
21:01
🔗
|
dashcloud |
thanks! |