[06:21] hi, just learned that Fileplanet was being shut down and archived [06:22] I used to help run Polycount, which used to have all their 3D models hosted there [06:23] is the best way to find all the files they lost by going through the metadata of the tars? [06:23] the wiki says all the data on the page is otherwise outdated [08:23] good think i backed stillflying.net up: http://fireflyfans.net/mthread.aspx?bid=2&tid=53804 [09:25] Vito``: #fireplanet :) [09:25] Vito``: we have not uploaded much yet [09:25] we will have a nice interface some day [09:25] but actually not for the polycount stuff (because that is from the older planet* hostisng and people put private files up their spaces [09:25] we got ALL the files so we cannot publish that [09:26] i am trying to host it so that if you know a path, you can download it (no public index) [09:26] that should prevent privacy issues [09:30] Vito``: i have the whole planetquake stuff locally on my machine,so if you need a specific file, just shout [09:30] i thought the models were mirrored by others already though, eg leileilol [09:51] schbirid1: if I compiled a list of paths you have them locally [09:51] ? [09:53] yeah [11:44] http://www.familyguyonline.com/ is shutting down Jan. 18. Might be worth grabbing whatever is one the site now, to remember the game. [11:44] they will probably redirect the domain eventually. [15:05] Merry Christmas Archivers!! [15:31] Is there a tutorial somewhere on how to use wget for different sites? [15:37] hiker1: there are wget examples on the page of many services on our wiki [15:38] What do you mean? [15:38] Can you give me an example? [15:45] wget -r -nH np -Amp3 --cut-dirs=1 http://foo.com/~tef/filez [15:46] makes a directory 'filez' with all the mp3s it found [15:46] -r - recursive, follow links [15:46] -nH - don't make a directory for the host (foo.com) [15:46] -np - don't go to a parent directory [15:46] --cut-dirs=1 strip '~tef' from the path [15:46] -Amp3 - only save mp3s [15:48] That doesn't use warc output. [15:50] http://www.archiveteam.org/index.php?title=Wget_with_WARC_output#Usage [15:50] oh [15:51] I was using like rewrite urls [15:51] and some other commands [15:51] It varies so much by website [15:54] and to download all the sites prerequisites [17:54] Is it possible to append to a warc file? [17:55] or append to a wget mirror? [17:55] The site I mirrored apparently uses a subdomain, but I used the --no-parent argument. [17:57] I also used --convert-links, but it did not convert links to the subdomain. [19:09] hiker2: from what i know, no [19:09] you can use -c but iirc it does not work too well with -m usualy [19:10] Someone in here mentioned they grabbed all the urls from a site before actually downloading the site. Is this possible? useful? [19:21] depends on the website [19:21] you can use --spider [19:21] BUT that will download, just not store [19:21] When would that be useful? [19:21] if you have no space and want to find out about the site structure [19:22] or if you are just interested in the URLs, not the data [19:22] It seems that since wget has no way to continue warc downloads, it would be useful to create a program that does. [19:22] *can [19:23] wget doesn't seem particularly well-suited to download complete mirrors of websites. [19:32] it could be better for sure [19:32] also eats memory :( [19:32] there is heretix which archive.org uses but i never tried that [19:32] httrack as well [19:32] but I don't think it supports WARC [19:33] i have had awful results with httrack [19:34] someone wrote http://code.google.com/p/httrack2arc/ [19:34] which converts httrack to ARC format [19:34] When I used HTTrack it worked for what I needed. [19:34] I think it resumes too [20:04] Too bad it's running on a retarded operating system with a crappy file system that's case insensitive [20:06] httrack is on linux too [20:07] Huh, didn't know that [20:21] MERRY CHRISTMAS ARCHIVE TEAM [20:21] JESUS SAVES AND SO DO WE [20:23] \o/ [20:35] http://i.imgur.com/Jek9D.jpg [21:07] http://www.carolinaherrera.com/212/es/areyouonthelist?share=2zkuHzwOxvy930fvZN7HOVc97XE-GNOL1fzysCqIoynkz4rz3EUUdzs6j6FXsjB4447F-isvxjqkXd4Qey2GHw#teaser [21:14] http://www.carolinaherrera.com/212/es/areyouonthelist?share=XTv1etZcVd-19S-VT5m1-oIXWSwtlJ3dj4ARKTLVwK7kz4rz3EUUdzs6j6FXsjB4447F-isvxjqkXd4Qey2GHw#episodio-1 [21:24] BUT MY EXPENSIVE UNWANTED THING [21:25] I like how the first thing to load on that page is a php error [22:44] heretrix isn't that good :v [22:59] what, in general? [23:01] I guess in the context of the earlier conversation, ie for a random-person-grab-site-expedition [23:10] SketchCow: well, it's a million lines of code, kinda interweaved. it sorta does the job though [23:10] my impression from picking through it trying to find out the format ideosynchrasies of ARC made me unhappy [23:11] at work we use something like phantomjs + mitmproxy to dump warcs. [23:17] don't get me wrong, i haven't had to use it in anger, but wget should perform just as well, considering it likely has very similar crawling logic [23:20] Is there a way to get wget to download external images? [23:20] like from tinypic.