#archiveteam 2012-12-25,Tue

↑back Search

Time	Nickname	Message
06:21 ^🔗	Vito``	hi, just learned that Fileplanet was being shut down and archived
06:22 ^🔗	Vito``	I used to help run Polycount, which used to have all their 3D models hosted there
06:23 ^🔗	Vito``	is the best way to find all the files they lost by going through the metadata of the tars?
06:23 ^🔗	Vito``	the wiki says all the data on the page is otherwise outdated
08:23 ^🔗	godane	good think i backed stillflying.net up: http://fireflyfans.net/mthread.aspx?bid=2&tid=53804
09:25 ^🔗	schbirid1	Vito``: #fireplanet :)
09:25 ^🔗	schbirid1	Vito``: we have not uploaded much yet
09:25 ^🔗	schbirid1	we will have a nice interface some day
09:25 ^🔗	schbirid1	but actually not for the polycount stuff (because that is from the older planet* hostisng and people put private files up their spaces
09:25 ^🔗	schbirid1	we got ALL the files so we cannot publish that
09:26 ^🔗	schbirid1	i am trying to host it so that if you know a path, you can download it (no public index)
09:26 ^🔗	schbirid1	that should prevent privacy issues
09:30 ^🔗	schbirid1	Vito``: i have the whole planetquake stuff locally on my machine,so if you need a specific file, just shout
09:30 ^🔗	schbirid1	i thought the models were mirrored by others already though, eg leileilol
09:51 ^🔗	Vito``	schbirid1: if I compiled a list of paths you have them locally
09:51 ^🔗	Vito``	?
09:53 ^🔗	schbirid1	yeah
11:44 ^🔗	hiker1	http://www.familyguyonline.com/ is shutting down Jan. 18. Might be worth grabbing whatever is one the site now, to remember the game.
11:44 ^🔗	hiker1	they will probably redirect the domain eventually.
15:05 ^🔗	no2pencil	Merry Christmas Archivers!!
15:31 ^🔗	hiker1	Is there a tutorial somewhere on how to use wget for different sites?
15:37 ^🔗	Nemo_bis	hiker1: there are wget examples on the page of many services on our wiki
15:38 ^🔗	hiker1	What do you mean?
15:38 ^🔗	hiker1	Can you give me an example?
15:45 ^🔗	tef	wget -r -nH np -Amp3 --cut-dirs=1 http://foo.com/~tef/filez
15:46 ^🔗	tef	makes a directory 'filez' with all the mp3s it found
15:46 ^🔗	tef	-r - recursive, follow links
15:46 ^🔗	tef	-nH - don't make a directory for the host (foo.com)
15:46 ^🔗	tef	-np - don't go to a parent directory
15:46 ^🔗	tef	--cut-dirs=1 strip '~tef' from the path
15:46 ^🔗	tef	-Amp3 - only save mp3s
15:48 ^🔗	hiker1	That doesn't use warc output.
15:50 ^🔗	Deewiant	http://www.archiveteam.org/index.php?title=Wget_with_WARC_output#Usage
15:50 ^🔗	tef	oh
15:51 ^🔗	hiker1	I was using like rewrite urls
15:51 ^🔗	hiker1	and some other commands
15:51 ^🔗	hiker1	It varies so much by website
15:54 ^🔗	hiker1	and to download all the sites prerequisites
17:54 ^🔗	hiker1	Is it possible to append to a warc file?
17:55 ^🔗	hiker1	or append to a wget mirror?
17:55 ^🔗	hiker1	The site I mirrored apparently uses a subdomain, but I used the --no-parent argument.
17:57 ^🔗	hiker1	I also used --convert-links, but it did not convert links to the subdomain.
19:09 ^🔗	schbirid1	hiker2: from what i know, no
19:09 ^🔗	schbirid1	you can use -c but iirc it does not work too well with -m usualy
19:10 ^🔗	hiker2	Someone in here mentioned they grabbed all the urls from a site before actually downloading the site. Is this possible? useful?
19:21 ^🔗	schbirid1	depends on the website
19:21 ^🔗	schbirid1	you can use --spider
19:21 ^🔗	schbirid1	BUT that will download, just not store
19:21 ^🔗	hiker2	When would that be useful?
19:21 ^🔗	schbirid1	if you have no space and want to find out about the site structure
19:22 ^🔗	schbirid1	or if you are just interested in the URLs, not the data
19:22 ^🔗	hiker2	It seems that since wget has no way to continue warc downloads, it would be useful to create a program that does.
19:22 ^🔗	hiker2	*can
19:23 ^🔗	hiker2	wget doesn't seem particularly well-suited to download complete mirrors of websites.
19:32 ^🔗	schbirid1	it could be better for sure
19:32 ^🔗	schbirid1	also eats memory :(
19:32 ^🔗	schbirid1	there is heretix which archive.org uses but i never tried that
19:32 ^🔗	hiker2	httrack as well
19:32 ^🔗	hiker2	but I don't think it supports WARC
19:33 ^🔗	schbirid1	i have had awful results with httrack
19:34 ^🔗	hiker2	someone wrote http://code.google.com/p/httrack2arc/
19:34 ^🔗	hiker2	which converts httrack to ARC format
19:34 ^🔗	hiker2	When I used HTTrack it worked for what I needed.
19:34 ^🔗	hiker2	I think it resumes too
20:04 ^🔗	ersi	Too bad it's running on a retarded operating system with a crappy file system that's case insensitive
20:06 ^🔗	schbirid1	httrack is on linux too
20:07 ^🔗	ersi	Huh, didn't know that
20:21 ^🔗	SketchCow	MERRY CHRISTMAS ARCHIVE TEAM
20:21 ^🔗	SketchCow	JESUS SAVES AND SO DO WE
20:23 ^🔗	SmileyG	\o/
20:35 ^🔗	ersi	http://i.imgur.com/Jek9D.jpg
21:07 ^🔗	rubita	http://www.carolinaherrera.com/212/es/areyouonthelist?share=2zkuHzwOxvy930fvZN7HOVc97XE-GNOL1fzysCqIoynkz4rz3EUUdzs6j6FXsjB4447F-isvxjqkXd4Qey2GHw#teaser
21:14 ^🔗	rubita	http://www.carolinaherrera.com/212/es/areyouonthelist?share=XTv1etZcVd-19S-VT5m1-oIXWSwtlJ3dj4ARKTLVwK7kz4rz3EUUdzs6j6FXsjB4447F-isvxjqkXd4Qey2GHw#episodio-1
21:24 ^🔗	SketchCow	BUT MY EXPENSIVE UNWANTED THING
21:25 ^🔗	chronomex	I like how the first thing to load on that page is a php error
22:44 ^🔗	tef	heretrix isn't that good :v
22:59 ^🔗	SketchCow	what, in general?
23:01 ^🔗	ersi	I guess in the context of the earlier conversation, ie for a random-person-grab-site-expedition
23:10 ^🔗	tef	SketchCow: well, it's a million lines of code, kinda interweaved. it sorta does the job though
23:10 ^🔗	tef	my impression from picking through it trying to find out the format ideosynchrasies of ARC made me unhappy
23:11 ^🔗	tef	at work we use something like phantomjs + mitmproxy to dump warcs.
23:17 ^🔗	tef	don't get me wrong, i haven't had to use it in anger, but wget should perform just as well, considering it likely has very similar crawling logic
23:20 ^🔗	hiker2	Is there a way to get wget to download external images?
23:20 ^🔗	hiker2	like from tinypic.

irclogger-viewer