#archiveteam 2013-09-29,Sun

↑back Search

Time	Nickname	Message
07:16 ^🔗	qwebirc29	Hey
07:17 ^🔗	BlueMax	hi
07:17 ^🔗	qwebirc29	Cool there
07:17 ^🔗	qwebirc29	there's someone here
07:17 ^🔗	qwebirc29	Hows it going?
07:17 ^🔗	qwebirc29	Any skilled archivists here?
07:18 ^🔗	BlueMax	there's plenty
07:18 ^🔗	qwebirc29	A site I'm on has been taken over, a lot of people have been locked out of their accounts...
07:18 ^🔗	qwebirc29	And the new admin has threatened to wipe their posts numerous times
07:19 ^🔗	joepie93	qwebirc29: what site is this?
07:19 ^🔗	qwebirc29	If anyone has any ideas how to make a backup I'm all ears. Thanks...
07:20 ^🔗	qwebirc29	The site's called www.amkon.net
07:20 ^🔗	qwebirc29	It used to be full of all sorts of renegade research... but now its in jeopardy
07:20 ^🔗	qwebirc29	I emailed Jason about it a few weeks back and he directed me here
07:21 ^🔗	joepie93	qwebirc29: hmm, I assume that most of it is only accessible to members?
07:21 ^🔗	qwebirc29	There's about a million posts there
07:21 ^🔗	qwebirc29	About half is accessible to members
07:21 ^🔗	joepie93	I see
07:21 ^🔗	qwebirc29	There's actually plenty in the public section. If I could just back up the public side that'd be cool
07:21 ^🔗	qwebirc29	I'm teaching myself how to use Wget.
07:21 ^🔗	joepie93	well, the best method would probably be to wget-warc the site using your own login cookie, however that would mean that your username would be visible on every page
07:22 ^🔗	joepie93	if that's not a problem, then you'll be fine
07:22 ^🔗	qwebirc29	I can do it anonymously publically
07:22 ^🔗	qwebirc29	They actually gave me permission to make a back up. But they're very erratic.
07:23 ^🔗	joepie93	well, it probably still won't be anonymous - your IP will end up in their logs, and I'm sure they keep IPs on user accounts
07:23 ^🔗	joepie93	but at least the archive won't have your username on every page
07:23 ^🔗	joepie93	:P
07:23 ^🔗	qwebirc29	Do you think I should set bandwidth limits on Wget?
07:23 ^🔗	joepie93	that depends
07:23 ^🔗	qwebirc29	I don't mind if they trace my IP
07:23 ^🔗	joepie93	how fast do you expect it to disappear?
07:23 ^🔗	qwebirc29	Actually Im just looking for advice on how to dload 10,000 threads fairly without swamping their bandwidth
07:24 ^🔗	joepie93	(also, unrelated side-note: "tracing IPs" probably doesn't mean what you think it means :D)
07:24 ^🔗	joepie93	well, you could introduce a pause between downloading pages
07:24 ^🔗	joepie93	using --wait
07:24 ^🔗	qwebirc29	I don't think it'll disappear soon. But I don't know
07:24 ^🔗	joepie93	http://www.archiveteam.org/index.php?title=Wget#Forum_Grab
07:24 ^🔗	qwebirc29	Good idea
07:24 ^🔗	joepie93	that's what I used to grab the team17 forums
07:24 ^🔗	joepie93	which also ran vbulletin
07:24 ^🔗	qwebirc29	How long did it take you to grab that?
07:25 ^🔗	qwebirc29	I appreciate the time your taking to answer these questions BTW
07:25 ^🔗	joepie93	quite a while, can't recall how long
07:25 ^🔗	joepie93	but expect it to take in the order of hours/days if it's thousands of threads
07:25 ^🔗	qwebirc29	And did the admin notice and IP ban you?
07:25 ^🔗	joepie93	depending on how many pages each thread is
07:25 ^🔗	joepie93	and, can't recall, it's been a while
07:25 ^🔗	qwebirc29	Cool that's helpful Joe.
07:26 ^🔗	qwebirc29	Just one more burning question if you have time...
07:26 ^🔗	qwebirc29	Do the threads come down in Php format?
07:26 ^🔗	qwebirc29	can I mass convert them to MHT later?
07:26 ^🔗	joepie93	if you use the stuff I just linked to, they will be saved as a .warc.gz file
07:26 ^🔗	joepie93	which is a specific archiving format
07:27 ^🔗	Archivist	OK
07:27 ^🔗	joepie93	that retains all data about the HTTP requests and such
07:27 ^🔗	Archivist	I have a bit of experience with Wget
07:27 ^🔗	joepie93	you can then upload it to archive.org (recommended, so the archive will be accessible to others)
07:27 ^🔗	Archivist	And can they be batch converted?
07:27 ^🔗	joepie93	and/or convert it to a zip using this: http://warctozip.archive.org/
07:27 ^🔗	joepie93	batch converted to what?
07:27 ^🔗	Archivist	Oh there are all sorts of legal issues with a public upload
07:27 ^🔗	Archivist	BAtch convert to MHT
07:27 ^🔗	joepie93	:P
07:27 ^🔗	joepie93	you really don't want MHT
07:28 ^🔗	joepie93	it's a terrible format
07:28 ^🔗	Archivist	Actually I just want to make an offline browser
07:28 ^🔗	joepie93	if you just extract the ZIP, you should be able to browse the pages on your local machine
07:28 ^🔗	Archivist	MHT seems to retain all the photos and gifs and bells and whistles
07:28 ^🔗	joepie93	anyway, as for the legal issues
07:28 ^🔗	joepie93	just upload it
07:28 ^🔗	joepie93	if archive.org gets complaints, they'll make it inaccessibl
07:28 ^🔗	joepie93	inaccessible *
07:28 ^🔗	joepie93	if they don't, then even better
07:28 ^🔗	joepie93	but to be fair, it's very very hard to raise legal issues against forum archives
07:28 ^🔗	Archivist	I don;t really want to screw with them right now. They would end up wiping my account
07:28 ^🔗	joepie93	because there are so many contributors
07:29 ^🔗	Archivist	Well exactly, right...
07:29 ^🔗	Archivist	But they are acting all strange.
07:29 ^🔗	joepie93	hmm..
07:29 ^🔗	joepie93	not sure who might know this
07:29 ^🔗	Archivist	THey bascially want the power to delete anyone's posts
07:29 ^🔗	joepie93	alard perhaps... does warc.gz retain the IP of the requesting client?
07:29 ^🔗	joepie93	or WARC, rather
07:29 ^🔗	joepie93	Archivist: depending on the answer to that, you could always just send it to me and I can upload it under my archive.org account
07:30 ^🔗	joepie93	and if the IP isn't kept in the WARC it's not tied to you
07:30 ^🔗	joepie93	but I'm not sure whether WARC retains that data or not
07:30 ^🔗	joepie93	<Archivist>MHT seems to retain all the photos and gifs and bells and whistles
07:30 ^🔗	joepie93	also
07:30 ^🔗	joepie93	MHT is just a container format
07:30 ^🔗	Archivist	OK I get it
07:31 ^🔗	joepie93	any archiving tool worth its salt should do that
07:31 ^🔗	joepie93	as does wget-warc, assuming you have page-requisites turned on
07:31 ^🔗	Archivist	OK cool Im starting to get it
07:31 ^🔗	joepie93	hmm
07:31 ^🔗	joepie93	moment
07:31 ^🔗	Archivist	page requisites are all the extras, right?
07:32 ^🔗	joepie93	wget -e robots=off --wait 0.25 "http://amkon.net/" --mirror --page-requisites --warc-file="at-amkon"
07:32 ^🔗	joepie93	that's what you'll want to do
07:32 ^🔗	joepie93	I think
07:32 ^🔗	joepie93	yes
07:32 ^🔗	joepie93	page-requisites are assets that are linked from the page as images, stylesheets, etc.
07:32 ^🔗	Archivist	Nice, cheers
07:33 ^🔗	Archivist	You guys are doing a valuable service
07:33 ^🔗	joepie93	so yeah, summary:
07:33 ^🔗	joepie93	1. wget -e robots=off --wait 0.25 "http://amkon.net/" --mirror --page-requisites --warc-file="at-amkon"
07:33 ^🔗	Archivist	Cool....
07:33 ^🔗	joepie93	2. if WARC doesn't retain IPs (which someone else should elaborate on), you can send the resulting warc.gz to me and I can upload it to IA for you
07:33 ^🔗	Archivist	Umm. another thing.. please tell me if that's too many questions
07:33 ^🔗	joepie93	3. convert to ZIP using http://warctozip.archive.org/
07:33 ^🔗	joepie93	and you have a local copy
07:33 ^🔗	joepie93	and no, just ask away :P
07:33 ^🔗	Archivist	Can I start one night, then break and do it again another night?
07:34 ^🔗	joepie93	I don't think it's possible (when using warc at least), but someone more familiar with wget may contradict me on that
07:34 ^🔗	joepie93	so I'm not sur
07:34 ^🔗	joepie93	sure *
07:34 ^🔗	BlueMax	is the site so big as to require downloading over one night?
07:35 ^🔗	joepie93	BlueMax: idk, it's a vbulletin forum, those are usually pretty noisy wrt different URLs and URL formats
07:35 ^🔗	joepie93	and using a --wait that may rack up archiving time quickly
07:36 ^🔗	joepie93	and BlueMax, perhaps you can answer that: does wget-warc keep the requesting IP address in the resulting warc.gz?
07:36 ^🔗	joepie93	ie., is it possible to identify the IP of the archivist from the archive
07:36 ^🔗	BlueMax	I honestly have no idea lol I was referring to the amount of time to download the website
07:36 ^🔗	BlueMax	Didn't know keeping IP addresses was important
07:37 ^🔗	joepie93	BlueMax: they're two separate topics
07:37 ^🔗	BlueMax	oh
07:37 ^🔗	joepie93	"it's a vBulletin forum so archiving might take a while because it's inconsistent in URL format and --wait is used"
07:37 ^🔗	joepie93	and
07:37 ^🔗	joepie93	"also, does wget-warc keep requester IPs?"
07:37 ^🔗	joepie93	:P
07:37 ^🔗	brayden	pshh...
07:37 ^🔗	*	brayden uses urllib
07:38 ^🔗	joepie93	brayden: bah
07:38 ^🔗	joepie93	then at least use requests
07:38 ^🔗	*	joepie93 thinks requests should be in stdlib
07:38 ^🔗	Archivist	Im reading what you wrote now
07:39 ^🔗	brayden	Yeah.. have to agree with the introduction they have. It is such a pain in the ass to do cookies and user-agents on urllib :(
07:39 ^🔗	brayden	I'm tempted though to use Tornado instead for their async HTTP client
07:39 ^🔗	brayden	http://www.tornadoweb.org/en/stable/httpclient.html
07:40 ^🔗	joepie93	brayden: mmm... async is the only thing I think requests doesn't have
07:40 ^🔗	joepie93	oh, also, if you need to do requests from a specific interface, I have a patch for that
07:40 ^🔗	joepie93	for requests
07:40 ^🔗	brayden	tornado reckons it can do that if you enable use of pycurl
07:41 ^🔗	brayden	never had to though, I don't have more than one :(
07:41 ^🔗	Archivist	Sorry, is Tornado related to the Amkon back up or is it a different topic?
07:41 ^🔗	godane	does anyone do backups of digital planet/click podcast on the bbc?
07:41 ^🔗	brayden	not in the very slightest unless you want an asynchronous HTTP client to backup Amkon?
07:41 ^🔗	joepie93	Archivist: nah, unrelated
07:41 ^🔗	joepie93	brayden: https://gist.github.com/joepie91/5896273
07:41 ^🔗	Archivist	OK cool.
07:41 ^🔗	joepie93	technically it's pick your own IP, not pick your own interface... but hey!
07:42 ^🔗	Archivist	Thanks for answering all those questions, I have a fair idea where to start now
07:42 ^🔗	godane	cause that show deletes episodes after 30 days and the way back machine doesn't even have a good archive 2013
07:42 ^🔗	joepie93	though I think we're moving into -bs material here
07:42 ^🔗	*	brayden runs away to #archiveteam-bs
07:42 ^🔗	joepie93	Archivist: np, if you have any further questions, feel free to ask
07:45 ^🔗	qwebirc29	Cheers all. Have a good Sunday!
07:46 ^🔗	BlueMax	well wasn't he a nice fellow
08:13 ^🔗	yipdw	oh
08:13 ^🔗	yipdw	I should have let him know that amkon.net is far too big for wget to just grab, unless he has a machine with gobs of RAM
19:19 ^🔗	bsmith093	is urlteam down or something?
20:36 ^🔗	omf_	From twitter http://zapd.com/ is closing in 1 week
20:38 ^🔗	omf_	wget won't work on that site because all the image content is loaded via javascript
21:23 ^🔗	chfoo	i created the wiki page for zapd: http://archiveteam.org/index.php?title=Zapd
22:07 ^🔗	*	robink is having issues mirroring a site with wget
23:05 ^🔗	rigel	hi
23:05 ^🔗	rigel	so i downloaded the warrior image
23:05 ^🔗	rigel	and it creates a second HD that is 60gb in size
23:05 ^🔗	rigel	i dont have nearly that much free space
23:05 ^🔗	rigel	do i need it to be that large?
23:10 ^🔗	chfoo	rigel: i think that's the recommended size to avoid unexpectedly running out of disk space
23:11 ^🔗	chfoo	but typically, the second disk image gets fill for me about 25gb
23:13 ^🔗	rigel	i see
23:13 ^🔗	rigel	well, sorry i couldn't be of help
23:15 ^🔗	chfoo	:(
23:19 ^🔗	xmc	does the warrior have TRIM?
23:20 ^🔗	xmc	virtualbox properly trims vhds, apparently

irclogger-viewer