Time |
Nickname |
Message |
07:16
🔗
|
qwebirc29 |
Hey |
07:17
🔗
|
BlueMax |
hi |
07:17
🔗
|
qwebirc29 |
Cool there |
07:17
🔗
|
qwebirc29 |
there's someone here |
07:17
🔗
|
qwebirc29 |
Hows it going? |
07:17
🔗
|
qwebirc29 |
Any skilled archivists here? |
07:18
🔗
|
BlueMax |
there's plenty |
07:18
🔗
|
qwebirc29 |
A site I'm on has been taken over, a lot of people have been locked out of their accounts... |
07:18
🔗
|
qwebirc29 |
And the new admin has threatened to wipe their posts numerous times |
07:19
🔗
|
joepie93 |
qwebirc29: what site is this? |
07:19
🔗
|
qwebirc29 |
If anyone has any ideas how to make a backup I'm all ears. Thanks... |
07:20
🔗
|
qwebirc29 |
The site's called www.amkon.net |
07:20
🔗
|
qwebirc29 |
It used to be full of all sorts of renegade research... but now its in jeopardy |
07:20
🔗
|
qwebirc29 |
I emailed Jason about it a few weeks back and he directed me here |
07:21
🔗
|
joepie93 |
qwebirc29: hmm, I assume that most of it is only accessible to members? |
07:21
🔗
|
qwebirc29 |
There's about a million posts there |
07:21
🔗
|
qwebirc29 |
About half is accessible to members |
07:21
🔗
|
joepie93 |
I see |
07:21
🔗
|
qwebirc29 |
There's actually plenty in the public section. If I could just back up the public side that'd be cool |
07:21
🔗
|
qwebirc29 |
I'm teaching myself how to use Wget. |
07:21
🔗
|
joepie93 |
well, the best method would probably be to wget-warc the site using your own login cookie, however that would mean that your username would be visible on every page |
07:22
🔗
|
joepie93 |
if that's not a problem, then you'll be fine |
07:22
🔗
|
qwebirc29 |
I can do it anonymously publically |
07:22
🔗
|
qwebirc29 |
They actually gave me permission to make a back up. But they're very erratic. |
07:23
🔗
|
joepie93 |
well, it probably still won't be anonymous - your IP will end up in their logs, and I'm sure they keep IPs on user accounts |
07:23
🔗
|
joepie93 |
but at least the archive won't have your username on every page |
07:23
🔗
|
joepie93 |
:P |
07:23
🔗
|
qwebirc29 |
Do you think I should set bandwidth limits on Wget? |
07:23
🔗
|
joepie93 |
that depends |
07:23
🔗
|
qwebirc29 |
I don't mind if they trace my IP |
07:23
🔗
|
joepie93 |
how fast do you expect it to disappear? |
07:23
🔗
|
qwebirc29 |
Actually Im just looking for advice on how to dload 10,000 threads fairly without swamping their bandwidth |
07:24
🔗
|
joepie93 |
(also, unrelated side-note: "tracing IPs" probably doesn't mean what you think it means :D) |
07:24
🔗
|
joepie93 |
well, you could introduce a pause between downloading pages |
07:24
🔗
|
joepie93 |
using --wait |
07:24
🔗
|
qwebirc29 |
I don't think it'll disappear soon. But I don't know |
07:24
🔗
|
joepie93 |
http://www.archiveteam.org/index.php?title=Wget#Forum_Grab |
07:24
🔗
|
qwebirc29 |
Good idea |
07:24
🔗
|
joepie93 |
that's what I used to grab the team17 forums |
07:24
🔗
|
joepie93 |
which also ran vbulletin |
07:24
🔗
|
qwebirc29 |
How long did it take you to grab that? |
07:25
🔗
|
qwebirc29 |
I appreciate the time your taking to answer these questions BTW |
07:25
🔗
|
joepie93 |
quite a while, can't recall how long |
07:25
🔗
|
joepie93 |
but expect it to take in the order of hours/days if it's thousands of threads |
07:25
🔗
|
qwebirc29 |
And did the admin notice and IP ban you? |
07:25
🔗
|
joepie93 |
depending on how many pages each thread is |
07:25
🔗
|
joepie93 |
and, can't recall, it's been a while |
07:25
🔗
|
qwebirc29 |
Cool that's helpful Joe. |
07:26
🔗
|
qwebirc29 |
Just one more burning question if you have time... |
07:26
🔗
|
qwebirc29 |
Do the threads come down in Php format? |
07:26
🔗
|
qwebirc29 |
can I mass convert them to MHT later? |
07:26
🔗
|
joepie93 |
if you use the stuff I just linked to, they will be saved as a .warc.gz file |
07:26
🔗
|
joepie93 |
which is a specific archiving format |
07:27
🔗
|
Archivist |
OK |
07:27
🔗
|
joepie93 |
that retains all data about the HTTP requests and such |
07:27
🔗
|
Archivist |
I have a bit of experience with Wget |
07:27
🔗
|
joepie93 |
you can then upload it to archive.org (recommended, so the archive will be accessible to others) |
07:27
🔗
|
Archivist |
And can they be batch converted? |
07:27
🔗
|
joepie93 |
and/or convert it to a zip using this: http://warctozip.archive.org/ |
07:27
🔗
|
joepie93 |
batch converted to what? |
07:27
🔗
|
Archivist |
Oh there are all sorts of legal issues with a public upload |
07:27
🔗
|
Archivist |
BAtch convert to MHT |
07:27
🔗
|
joepie93 |
:P |
07:27
🔗
|
joepie93 |
you really don't want MHT |
07:28
🔗
|
joepie93 |
it's a terrible format |
07:28
🔗
|
Archivist |
Actually I just want to make an offline browser |
07:28
🔗
|
joepie93 |
if you just extract the ZIP, you should be able to browse the pages on your local machine |
07:28
🔗
|
Archivist |
MHT seems to retain all the photos and gifs and bells and whistles |
07:28
🔗
|
joepie93 |
anyway, as for the legal issues |
07:28
🔗
|
joepie93 |
just upload it |
07:28
🔗
|
joepie93 |
if archive.org gets complaints, they'll make it inaccessibl |
07:28
🔗
|
joepie93 |
inaccessible * |
07:28
🔗
|
joepie93 |
if they don't, then even better |
07:28
🔗
|
joepie93 |
but to be fair, it's very very hard to raise legal issues against forum archives |
07:28
🔗
|
Archivist |
I don;t really want to screw with them right now. They would end up wiping my account |
07:28
🔗
|
joepie93 |
because there are so many contributors |
07:29
🔗
|
Archivist |
Well exactly, right... |
07:29
🔗
|
Archivist |
But they are acting all strange. |
07:29
🔗
|
joepie93 |
hmm.. |
07:29
🔗
|
joepie93 |
not sure who might know this |
07:29
🔗
|
Archivist |
THey bascially want the power to delete anyone's posts |
07:29
🔗
|
joepie93 |
alard perhaps... does warc.gz retain the IP of the requesting client? |
07:29
🔗
|
joepie93 |
or WARC, rather |
07:29
🔗
|
joepie93 |
Archivist: depending on the answer to that, you could always just send it to me and I can upload it under my archive.org account |
07:30
🔗
|
joepie93 |
and if the IP isn't kept in the WARC it's not tied to you |
07:30
🔗
|
joepie93 |
but I'm not sure whether WARC retains that data or not |
07:30
🔗
|
joepie93 |
<Archivist>MHT seems to retain all the photos and gifs and bells and whistles |
07:30
🔗
|
joepie93 |
also |
07:30
🔗
|
joepie93 |
MHT is just a container format |
07:30
🔗
|
Archivist |
OK I get it |
07:31
🔗
|
joepie93 |
any archiving tool worth its salt should do that |
07:31
🔗
|
joepie93 |
as does wget-warc, assuming you have page-requisites turned on |
07:31
🔗
|
Archivist |
OK cool Im starting to get it |
07:31
🔗
|
joepie93 |
hmm |
07:31
🔗
|
joepie93 |
moment |
07:31
🔗
|
Archivist |
page requisites are all the extras, right? |
07:32
🔗
|
joepie93 |
wget -e robots=off --wait 0.25 "http://amkon.net/" --mirror --page-requisites --warc-file="at-amkon" |
07:32
🔗
|
joepie93 |
that's what you'll want to do |
07:32
🔗
|
joepie93 |
I think |
07:32
🔗
|
joepie93 |
yes |
07:32
🔗
|
joepie93 |
page-requisites are assets that are linked from the page as images, stylesheets, etc. |
07:32
🔗
|
Archivist |
Nice, cheers |
07:33
🔗
|
Archivist |
You guys are doing a valuable service |
07:33
🔗
|
joepie93 |
so yeah, summary: |
07:33
🔗
|
joepie93 |
1. wget -e robots=off --wait 0.25 "http://amkon.net/" --mirror --page-requisites --warc-file="at-amkon" |
07:33
🔗
|
Archivist |
Cool.... |
07:33
🔗
|
joepie93 |
2. if WARC doesn't retain IPs (which someone else should elaborate on), you can send the resulting warc.gz to me and I can upload it to IA for you |
07:33
🔗
|
Archivist |
Umm. another thing.. please tell me if that's too many questions |
07:33
🔗
|
joepie93 |
3. convert to ZIP using http://warctozip.archive.org/ |
07:33
🔗
|
joepie93 |
and you have a local copy |
07:33
🔗
|
joepie93 |
and no, just ask away :P |
07:33
🔗
|
Archivist |
Can I start one night, then break and do it again another night? |
07:34
🔗
|
joepie93 |
I don't *think* it's possible (when using warc at least), but someone more familiar with wget may contradict me on that |
07:34
🔗
|
joepie93 |
so I'm not sur |
07:34
🔗
|
joepie93 |
sure * |
07:34
🔗
|
BlueMax |
is the site so big as to require downloading over one night? |
07:35
🔗
|
joepie93 |
BlueMax: idk, it's a vbulletin forum, those are usually pretty noisy wrt different URLs and URL formats |
07:35
🔗
|
joepie93 |
and using a --wait that may rack up archiving time quickly |
07:36
🔗
|
joepie93 |
and BlueMax, perhaps you can answer that: does wget-warc keep the requesting IP address in the resulting warc.gz? |
07:36
🔗
|
joepie93 |
ie., is it possible to identify the IP of the archivist from the archive |
07:36
🔗
|
BlueMax |
I honestly have no idea lol I was referring to the amount of time to download the website |
07:36
🔗
|
BlueMax |
Didn't know keeping IP addresses was important |
07:37
🔗
|
joepie93 |
BlueMax: they're two separate topics |
07:37
🔗
|
BlueMax |
oh |
07:37
🔗
|
joepie93 |
"it's a vBulletin forum so archiving might take a while because it's inconsistent in URL format and --wait is used" |
07:37
🔗
|
joepie93 |
and |
07:37
🔗
|
joepie93 |
"also, does wget-warc keep requester IPs?" |
07:37
🔗
|
joepie93 |
:P |
07:37
🔗
|
brayden |
pshh... |
07:37
🔗
|
* |
brayden uses urllib |
07:38
🔗
|
joepie93 |
brayden: bah |
07:38
🔗
|
joepie93 |
then at least use requests |
07:38
🔗
|
* |
joepie93 thinks requests should be in stdlib |
07:38
🔗
|
Archivist |
Im reading what you wrote now |
07:39
🔗
|
brayden |
Yeah.. have to agree with the introduction they have. It is such a pain in the ass to do cookies and user-agents on urllib :( |
07:39
🔗
|
brayden |
I'm tempted though to use Tornado instead for their async HTTP client |
07:39
🔗
|
brayden |
http://www.tornadoweb.org/en/stable/httpclient.html |
07:40
🔗
|
joepie93 |
brayden: mmm... async is the only thing I think requests doesn't have |
07:40
🔗
|
joepie93 |
oh, also, if you need to do requests from a specific interface, I have a patch for that |
07:40
🔗
|
joepie93 |
for requests |
07:40
🔗
|
brayden |
tornado reckons it can do that if you enable use of pycurl |
07:41
🔗
|
brayden |
never had to though, I don't have more than one :( |
07:41
🔗
|
Archivist |
Sorry, is Tornado related to the Amkon back up or is it a different topic? |
07:41
🔗
|
godane |
does anyone do backups of digital planet/click podcast on the bbc? |
07:41
🔗
|
brayden |
not in the very slightest unless you want an asynchronous HTTP client to backup Amkon? |
07:41
🔗
|
joepie93 |
Archivist: nah, unrelated |
07:41
🔗
|
joepie93 |
brayden: https://gist.github.com/joepie91/5896273 |
07:41
🔗
|
Archivist |
OK cool. |
07:41
🔗
|
joepie93 |
technically it's pick your own IP, not pick your own interface... but hey! |
07:42
🔗
|
Archivist |
Thanks for answering all those questions, I have a fair idea where to start now |
07:42
🔗
|
godane |
cause that show deletes episodes after 30 days and the way back machine doesn't even have a good archive 2013 |
07:42
🔗
|
joepie93 |
though I think we're moving into -bs material here |
07:42
🔗
|
* |
brayden runs away to #archiveteam-bs |
07:42
🔗
|
joepie93 |
Archivist: np, if you have any further questions, feel free to ask |
07:45
🔗
|
qwebirc29 |
Cheers all. Have a good Sunday! |
07:46
🔗
|
BlueMax |
well wasn't he a nice fellow |
08:13
🔗
|
yipdw |
oh |
08:13
🔗
|
yipdw |
I should have let him know that amkon.net is far too big for wget to just grab, unless he has a machine with gobs of RAM |
19:19
🔗
|
bsmith093 |
is urlteam down or something? |
20:36
🔗
|
omf_ |
From twitter http://zapd.com/ is closing in 1 week |
20:38
🔗
|
omf_ |
wget won't work on that site because all the image content is loaded via javascript |
21:23
🔗
|
chfoo |
i created the wiki page for zapd: http://archiveteam.org/index.php?title=Zapd |
22:07
🔗
|
* |
robink is having issues mirroring a site with wget |
23:05
🔗
|
rigel |
hi |
23:05
🔗
|
rigel |
so i downloaded the warrior image |
23:05
🔗
|
rigel |
and it creates a second HD that is 60gb in size |
23:05
🔗
|
rigel |
i dont have nearly that much free space |
23:05
🔗
|
rigel |
do i need it to be that large? |
23:10
🔗
|
chfoo |
rigel: i think that's the recommended size to avoid unexpectedly running out of disk space |
23:11
🔗
|
chfoo |
but typically, the second disk image gets fill for me about 25gb |
23:13
🔗
|
rigel |
i see |
23:13
🔗
|
rigel |
well, sorry i couldn't be of help |
23:15
🔗
|
chfoo |
:( |
23:19
🔗
|
xmc |
does the warrior have TRIM? |
23:20
🔗
|
xmc |
virtualbox properly trims vhds, apparently |