[07:16] Hey [07:17] hi [07:17] Cool there [07:17] there's someone here [07:17] Hows it going? [07:17] Any skilled archivists here? [07:18] there's plenty [07:18] A site I'm on has been taken over, a lot of people have been locked out of their accounts... [07:18] And the new admin has threatened to wipe their posts numerous times [07:19] qwebirc29: what site is this? [07:19] If anyone has any ideas how to make a backup I'm all ears. Thanks... [07:20] The site's called www.amkon.net [07:20] It used to be full of all sorts of renegade research... but now its in jeopardy [07:20] I emailed Jason about it a few weeks back and he directed me here [07:21] qwebirc29: hmm, I assume that most of it is only accessible to members? [07:21] There's about a million posts there [07:21] About half is accessible to members [07:21] I see [07:21] There's actually plenty in the public section. If I could just back up the public side that'd be cool [07:21] I'm teaching myself how to use Wget. [07:21] well, the best method would probably be to wget-warc the site using your own login cookie, however that would mean that your username would be visible on every page [07:22] if that's not a problem, then you'll be fine [07:22] I can do it anonymously publically [07:22] They actually gave me permission to make a back up. But they're very erratic. [07:23] well, it probably still won't be anonymous - your IP will end up in their logs, and I'm sure they keep IPs on user accounts [07:23] but at least the archive won't have your username on every page [07:23] :P [07:23] Do you think I should set bandwidth limits on Wget? [07:23] that depends [07:23] I don't mind if they trace my IP [07:23] how fast do you expect it to disappear? [07:23] Actually Im just looking for advice on how to dload 10,000 threads fairly without swamping their bandwidth [07:24] (also, unrelated side-note: "tracing IPs" probably doesn't mean what you think it means :D) [07:24] well, you could introduce a pause between downloading pages [07:24] using --wait [07:24] I don't think it'll disappear soon. But I don't know [07:24] http://www.archiveteam.org/index.php?title=Wget#Forum_Grab [07:24] Good idea [07:24] that's what I used to grab the team17 forums [07:24] which also ran vbulletin [07:24] How long did it take you to grab that? [07:25] I appreciate the time your taking to answer these questions BTW [07:25] quite a while, can't recall how long [07:25] but expect it to take in the order of hours/days if it's thousands of threads [07:25] And did the admin notice and IP ban you? [07:25] depending on how many pages each thread is [07:25] and, can't recall, it's been a while [07:25] Cool that's helpful Joe. [07:26] Just one more burning question if you have time... [07:26] Do the threads come down in Php format? [07:26] can I mass convert them to MHT later? [07:26] if you use the stuff I just linked to, they will be saved as a .warc.gz file [07:26] which is a specific archiving format [07:27] OK [07:27] that retains all data about the HTTP requests and such [07:27] I have a bit of experience with Wget [07:27] you can then upload it to archive.org (recommended, so the archive will be accessible to others) [07:27] And can they be batch converted? [07:27] and/or convert it to a zip using this: http://warctozip.archive.org/ [07:27] batch converted to what? [07:27] Oh there are all sorts of legal issues with a public upload [07:27] BAtch convert to MHT [07:27] :P [07:27] you really don't want MHT [07:28] it's a terrible format [07:28] Actually I just want to make an offline browser [07:28] if you just extract the ZIP, you should be able to browse the pages on your local machine [07:28] MHT seems to retain all the photos and gifs and bells and whistles [07:28] anyway, as for the legal issues [07:28] just upload it [07:28] if archive.org gets complaints, they'll make it inaccessibl [07:28] inaccessible * [07:28] if they don't, then even better [07:28] but to be fair, it's very very hard to raise legal issues against forum archives [07:28] I don;t really want to screw with them right now. They would end up wiping my account [07:28] because there are so many contributors [07:29] Well exactly, right... [07:29] But they are acting all strange. [07:29] hmm.. [07:29] not sure who might know this [07:29] THey bascially want the power to delete anyone's posts [07:29] alard perhaps... does warc.gz retain the IP of the requesting client? [07:29] or WARC, rather [07:29] Archivist: depending on the answer to that, you could always just send it to me and I can upload it under my archive.org account [07:30] and if the IP isn't kept in the WARC it's not tied to you [07:30] but I'm not sure whether WARC retains that data or not [07:30] MHT seems to retain all the photos and gifs and bells and whistles [07:30] also [07:30] MHT is just a container format [07:30] OK I get it [07:31] any archiving tool worth its salt should do that [07:31] as does wget-warc, assuming you have page-requisites turned on [07:31] OK cool Im starting to get it [07:31] hmm [07:31] moment [07:31] page requisites are all the extras, right? [07:32] wget -e robots=off --wait 0.25 "http://amkon.net/" --mirror --page-requisites --warc-file="at-amkon" [07:32] that's what you'll want to do [07:32] I think [07:32] yes [07:32] page-requisites are assets that are linked from the page as images, stylesheets, etc. [07:32] Nice, cheers [07:33] You guys are doing a valuable service [07:33] so yeah, summary: [07:33] 1. wget -e robots=off --wait 0.25 "http://amkon.net/" --mirror --page-requisites --warc-file="at-amkon" [07:33] Cool.... [07:33] 2. if WARC doesn't retain IPs (which someone else should elaborate on), you can send the resulting warc.gz to me and I can upload it to IA for you [07:33] Umm. another thing.. please tell me if that's too many questions [07:33] 3. convert to ZIP using http://warctozip.archive.org/ [07:33] and you have a local copy [07:33] and no, just ask away :P [07:33] Can I start one night, then break and do it again another night? [07:34] I don't *think* it's possible (when using warc at least), but someone more familiar with wget may contradict me on that [07:34] so I'm not sur [07:34] sure * [07:34] is the site so big as to require downloading over one night? [07:35] BlueMax: idk, it's a vbulletin forum, those are usually pretty noisy wrt different URLs and URL formats [07:35] and using a --wait that may rack up archiving time quickly [07:36] and BlueMax, perhaps you can answer that: does wget-warc keep the requesting IP address in the resulting warc.gz? [07:36] ie., is it possible to identify the IP of the archivist from the archive [07:36] I honestly have no idea lol I was referring to the amount of time to download the website [07:36] Didn't know keeping IP addresses was important [07:37] BlueMax: they're two separate topics [07:37] oh [07:37] "it's a vBulletin forum so archiving might take a while because it's inconsistent in URL format and --wait is used" [07:37] and [07:37] "also, does wget-warc keep requester IPs?" [07:37] :P [07:37] pshh... [07:37] * brayden uses urllib [07:38] brayden: bah [07:38] then at least use requests [07:38] * joepie93 thinks requests should be in stdlib [07:38] Im reading what you wrote now [07:39] Yeah.. have to agree with the introduction they have. It is such a pain in the ass to do cookies and user-agents on urllib :( [07:39] I'm tempted though to use Tornado instead for their async HTTP client [07:39] http://www.tornadoweb.org/en/stable/httpclient.html [07:40] brayden: mmm... async is the only thing I think requests doesn't have [07:40] oh, also, if you need to do requests from a specific interface, I have a patch for that [07:40] for requests [07:40] tornado reckons it can do that if you enable use of pycurl [07:41] never had to though, I don't have more than one :( [07:41] Sorry, is Tornado related to the Amkon back up or is it a different topic? [07:41] does anyone do backups of digital planet/click podcast on the bbc? [07:41] not in the very slightest unless you want an asynchronous HTTP client to backup Amkon? [07:41] Archivist: nah, unrelated [07:41] brayden: https://gist.github.com/joepie91/5896273 [07:41] OK cool. [07:41] technically it's pick your own IP, not pick your own interface... but hey! [07:42] Thanks for answering all those questions, I have a fair idea where to start now [07:42] cause that show deletes episodes after 30 days and the way back machine doesn't even have a good archive 2013 [07:42] though I think we're moving into -bs material here [07:42] * brayden runs away to #archiveteam-bs [07:42] Archivist: np, if you have any further questions, feel free to ask [07:45] Cheers all. Have a good Sunday! [07:46] well wasn't he a nice fellow [08:13] oh [08:13] I should have let him know that amkon.net is far too big for wget to just grab, unless he has a machine with gobs of RAM [19:19] is urlteam down or something? [20:36] From twitter http://zapd.com/ is closing in 1 week [20:38] wget won't work on that site because all the image content is loaded via javascript [21:23] i created the wiki page for zapd: http://archiveteam.org/index.php?title=Zapd [22:07] * robink is having issues mirroring a site with wget [23:05] hi [23:05] so i downloaded the warrior image [23:05] and it creates a second HD that is 60gb in size [23:05] i dont have nearly that much free space [23:05] do i need it to be that large? [23:10] rigel: i think that's the recommended size to avoid unexpectedly running out of disk space [23:11] but typically, the second disk image gets fill for me about 25gb [23:13] i see [23:13] well, sorry i couldn't be of help [23:15] :( [23:19] does the warrior have TRIM? [23:20] virtualbox properly trims vhds, apparently