#archiveteam 2013-09-29,Sun

↑back Search

Time Nickname Message
07:16 🔗 qwebirc29 Hey
07:17 🔗 BlueMax hi
07:17 🔗 qwebirc29 Cool there
07:17 🔗 qwebirc29 there's someone here
07:17 🔗 qwebirc29 Hows it going?
07:17 🔗 qwebirc29 Any skilled archivists here?
07:18 🔗 BlueMax there's plenty
07:18 🔗 qwebirc29 A site I'm on has been taken over, a lot of people have been locked out of their accounts...
07:18 🔗 qwebirc29 And the new admin has threatened to wipe their posts numerous times
07:19 🔗 joepie93 qwebirc29: what site is this?
07:19 🔗 qwebirc29 If anyone has any ideas how to make a backup I'm all ears. Thanks...
07:20 🔗 qwebirc29 The site's called www.amkon.net
07:20 🔗 qwebirc29 It used to be full of all sorts of renegade research... but now its in jeopardy
07:20 🔗 qwebirc29 I emailed Jason about it a few weeks back and he directed me here
07:21 🔗 joepie93 qwebirc29: hmm, I assume that most of it is only accessible to members?
07:21 🔗 qwebirc29 There's about a million posts there
07:21 🔗 qwebirc29 About half is accessible to members
07:21 🔗 joepie93 I see
07:21 🔗 qwebirc29 There's actually plenty in the public section. If I could just back up the public side that'd be cool
07:21 🔗 qwebirc29 I'm teaching myself how to use Wget.
07:21 🔗 joepie93 well, the best method would probably be to wget-warc the site using your own login cookie, however that would mean that your username would be visible on every page
07:22 🔗 joepie93 if that's not a problem, then you'll be fine
07:22 🔗 qwebirc29 I can do it anonymously publically
07:22 🔗 qwebirc29 They actually gave me permission to make a back up. But they're very erratic.
07:23 🔗 joepie93 well, it probably still won't be anonymous - your IP will end up in their logs, and I'm sure they keep IPs on user accounts
07:23 🔗 joepie93 but at least the archive won't have your username on every page
07:23 🔗 joepie93 :P
07:23 🔗 qwebirc29 Do you think I should set bandwidth limits on Wget?
07:23 🔗 joepie93 that depends
07:23 🔗 qwebirc29 I don't mind if they trace my IP
07:23 🔗 joepie93 how fast do you expect it to disappear?
07:23 🔗 qwebirc29 Actually Im just looking for advice on how to dload 10,000 threads fairly without swamping their bandwidth
07:24 🔗 joepie93 (also, unrelated side-note: "tracing IPs" probably doesn't mean what you think it means :D)
07:24 🔗 joepie93 well, you could introduce a pause between downloading pages
07:24 🔗 joepie93 using --wait
07:24 🔗 qwebirc29 I don't think it'll disappear soon. But I don't know
07:24 🔗 joepie93 http://www.archiveteam.org/index.php?title=Wget#Forum_Grab
07:24 🔗 qwebirc29 Good idea
07:24 🔗 joepie93 that's what I used to grab the team17 forums
07:24 🔗 joepie93 which also ran vbulletin
07:24 🔗 qwebirc29 How long did it take you to grab that?
07:25 🔗 qwebirc29 I appreciate the time your taking to answer these questions BTW
07:25 🔗 joepie93 quite a while, can't recall how long
07:25 🔗 joepie93 but expect it to take in the order of hours/days if it's thousands of threads
07:25 🔗 qwebirc29 And did the admin notice and IP ban you?
07:25 🔗 joepie93 depending on how many pages each thread is
07:25 🔗 joepie93 and, can't recall, it's been a while
07:25 🔗 qwebirc29 Cool that's helpful Joe.
07:26 🔗 qwebirc29 Just one more burning question if you have time...
07:26 🔗 qwebirc29 Do the threads come down in Php format?
07:26 🔗 qwebirc29 can I mass convert them to MHT later?
07:26 🔗 joepie93 if you use the stuff I just linked to, they will be saved as a .warc.gz file
07:26 🔗 joepie93 which is a specific archiving format
07:27 🔗 Archivist OK
07:27 🔗 joepie93 that retains all data about the HTTP requests and such
07:27 🔗 Archivist I have a bit of experience with Wget
07:27 🔗 joepie93 you can then upload it to archive.org (recommended, so the archive will be accessible to others)
07:27 🔗 Archivist And can they be batch converted?
07:27 🔗 joepie93 and/or convert it to a zip using this: http://warctozip.archive.org/
07:27 🔗 joepie93 batch converted to what?
07:27 🔗 Archivist Oh there are all sorts of legal issues with a public upload
07:27 🔗 Archivist BAtch convert to MHT
07:27 🔗 joepie93 :P
07:27 🔗 joepie93 you really don't want MHT
07:28 🔗 joepie93 it's a terrible format
07:28 🔗 Archivist Actually I just want to make an offline browser
07:28 🔗 joepie93 if you just extract the ZIP, you should be able to browse the pages on your local machine
07:28 🔗 Archivist MHT seems to retain all the photos and gifs and bells and whistles
07:28 🔗 joepie93 anyway, as for the legal issues
07:28 🔗 joepie93 just upload it
07:28 🔗 joepie93 if archive.org gets complaints, they'll make it inaccessibl
07:28 🔗 joepie93 inaccessible *
07:28 🔗 joepie93 if they don't, then even better
07:28 🔗 joepie93 but to be fair, it's very very hard to raise legal issues against forum archives
07:28 🔗 Archivist I don;t really want to screw with them right now. They would end up wiping my account
07:28 🔗 joepie93 because there are so many contributors
07:29 🔗 Archivist Well exactly, right...
07:29 🔗 Archivist But they are acting all strange.
07:29 🔗 joepie93 hmm..
07:29 🔗 joepie93 not sure who might know this
07:29 🔗 Archivist THey bascially want the power to delete anyone's posts
07:29 🔗 joepie93 alard perhaps... does warc.gz retain the IP of the requesting client?
07:29 🔗 joepie93 or WARC, rather
07:29 🔗 joepie93 Archivist: depending on the answer to that, you could always just send it to me and I can upload it under my archive.org account
07:30 🔗 joepie93 and if the IP isn't kept in the WARC it's not tied to you
07:30 🔗 joepie93 but I'm not sure whether WARC retains that data or not
07:30 🔗 joepie93 <Archivist>MHT seems to retain all the photos and gifs and bells and whistles
07:30 🔗 joepie93 also
07:30 🔗 joepie93 MHT is just a container format
07:30 🔗 Archivist OK I get it
07:31 🔗 joepie93 any archiving tool worth its salt should do that
07:31 🔗 joepie93 as does wget-warc, assuming you have page-requisites turned on
07:31 🔗 Archivist OK cool Im starting to get it
07:31 🔗 joepie93 hmm
07:31 🔗 joepie93 moment
07:31 🔗 Archivist page requisites are all the extras, right?
07:32 🔗 joepie93 wget -e robots=off --wait 0.25 "http://amkon.net/" --mirror --page-requisites --warc-file="at-amkon"
07:32 🔗 joepie93 that's what you'll want to do
07:32 🔗 joepie93 I think
07:32 🔗 joepie93 yes
07:32 🔗 joepie93 page-requisites are assets that are linked from the page as images, stylesheets, etc.
07:32 🔗 Archivist Nice, cheers
07:33 🔗 Archivist You guys are doing a valuable service
07:33 🔗 joepie93 so yeah, summary:
07:33 🔗 joepie93 1. wget -e robots=off --wait 0.25 "http://amkon.net/" --mirror --page-requisites --warc-file="at-amkon"
07:33 🔗 Archivist Cool....
07:33 🔗 joepie93 2. if WARC doesn't retain IPs (which someone else should elaborate on), you can send the resulting warc.gz to me and I can upload it to IA for you
07:33 🔗 Archivist Umm. another thing.. please tell me if that's too many questions
07:33 🔗 joepie93 3. convert to ZIP using http://warctozip.archive.org/
07:33 🔗 joepie93 and you have a local copy
07:33 🔗 joepie93 and no, just ask away :P
07:33 🔗 Archivist Can I start one night, then break and do it again another night?
07:34 🔗 joepie93 I don't *think* it's possible (when using warc at least), but someone more familiar with wget may contradict me on that
07:34 🔗 joepie93 so I'm not sur
07:34 🔗 joepie93 sure *
07:34 🔗 BlueMax is the site so big as to require downloading over one night?
07:35 🔗 joepie93 BlueMax: idk, it's a vbulletin forum, those are usually pretty noisy wrt different URLs and URL formats
07:35 🔗 joepie93 and using a --wait that may rack up archiving time quickly
07:36 🔗 joepie93 and BlueMax, perhaps you can answer that: does wget-warc keep the requesting IP address in the resulting warc.gz?
07:36 🔗 joepie93 ie., is it possible to identify the IP of the archivist from the archive
07:36 🔗 BlueMax I honestly have no idea lol I was referring to the amount of time to download the website
07:36 🔗 BlueMax Didn't know keeping IP addresses was important
07:37 🔗 joepie93 BlueMax: they're two separate topics
07:37 🔗 BlueMax oh
07:37 🔗 joepie93 "it's a vBulletin forum so archiving might take a while because it's inconsistent in URL format and --wait is used"
07:37 🔗 joepie93 and
07:37 🔗 joepie93 "also, does wget-warc keep requester IPs?"
07:37 🔗 joepie93 :P
07:37 🔗 brayden pshh...
07:37 🔗 * brayden uses urllib
07:38 🔗 joepie93 brayden: bah
07:38 🔗 joepie93 then at least use requests
07:38 🔗 * joepie93 thinks requests should be in stdlib
07:38 🔗 Archivist Im reading what you wrote now
07:39 🔗 brayden Yeah.. have to agree with the introduction they have. It is such a pain in the ass to do cookies and user-agents on urllib :(
07:39 🔗 brayden I'm tempted though to use Tornado instead for their async HTTP client
07:39 🔗 brayden http://www.tornadoweb.org/en/stable/httpclient.html
07:40 🔗 joepie93 brayden: mmm... async is the only thing I think requests doesn't have
07:40 🔗 joepie93 oh, also, if you need to do requests from a specific interface, I have a patch for that
07:40 🔗 joepie93 for requests
07:40 🔗 brayden tornado reckons it can do that if you enable use of pycurl
07:41 🔗 brayden never had to though, I don't have more than one :(
07:41 🔗 Archivist Sorry, is Tornado related to the Amkon back up or is it a different topic?
07:41 🔗 godane does anyone do backups of digital planet/click podcast on the bbc?
07:41 🔗 brayden not in the very slightest unless you want an asynchronous HTTP client to backup Amkon?
07:41 🔗 joepie93 Archivist: nah, unrelated
07:41 🔗 joepie93 brayden: https://gist.github.com/joepie91/5896273
07:41 🔗 Archivist OK cool.
07:41 🔗 joepie93 technically it's pick your own IP, not pick your own interface... but hey!
07:42 🔗 Archivist Thanks for answering all those questions, I have a fair idea where to start now
07:42 🔗 godane cause that show deletes episodes after 30 days and the way back machine doesn't even have a good archive 2013
07:42 🔗 joepie93 though I think we're moving into -bs material here
07:42 🔗 * brayden runs away to #archiveteam-bs
07:42 🔗 joepie93 Archivist: np, if you have any further questions, feel free to ask
07:45 🔗 qwebirc29 Cheers all. Have a good Sunday!
07:46 🔗 BlueMax well wasn't he a nice fellow
08:13 🔗 yipdw oh
08:13 🔗 yipdw I should have let him know that amkon.net is far too big for wget to just grab, unless he has a machine with gobs of RAM
19:19 🔗 bsmith093 is urlteam down or something?
20:36 🔗 omf_ From twitter http://zapd.com/ is closing in 1 week
20:38 🔗 omf_ wget won't work on that site because all the image content is loaded via javascript
21:23 🔗 chfoo i created the wiki page for zapd: http://archiveteam.org/index.php?title=Zapd
22:07 🔗 * robink is having issues mirroring a site with wget
23:05 🔗 rigel hi
23:05 🔗 rigel so i downloaded the warrior image
23:05 🔗 rigel and it creates a second HD that is 60gb in size
23:05 🔗 rigel i dont have nearly that much free space
23:05 🔗 rigel do i need it to be that large?
23:10 🔗 chfoo rigel: i think that's the recommended size to avoid unexpectedly running out of disk space
23:11 🔗 chfoo but typically, the second disk image gets fill for me about 25gb
23:13 🔗 rigel i see
23:13 🔗 rigel well, sorry i couldn't be of help
23:15 🔗 chfoo :(
23:19 🔗 xmc does the warrior have TRIM?
23:20 🔗 xmc virtualbox properly trims vhds, apparently

irclogger-viewer