[10:00] SketchCow: can you delete http://archive.org/details/wikitravel.org please? [11:58] exporting metadata for +1,000,000 books on IA [11:59] surpass 1gb [12:08] cool [12:08] emijrp: now re-archiving wikitravel.org [12:09] emijrp: because amazingly https://meta.wikimedia.org/wiki/Travel_Guide complained that they don't provide dumps, when it's so easy to make one! [12:09] current only? [12:09] surely that's not a reson to change hosting [12:09] no, all history; that was only a bug... [12:09] i mean, wikitravel servers are weak [12:10] and full history crashed in the past [12:10] crashed how? [12:10] I'm getting lots of httplib.IncompleteRead: IncompleteRead(359 bytes read, 1678 more expected) [12:10] but you only have to restart [12:11] IIRC [12:11] crashed or slow as hell and i get tired [12:11] hehe [12:25] Results: 1 through 50 of 4,929,084 (49.686 secs) [12:26] i remember some IA graphs and stats...was there one with item unmbers? [12:58] emijrp: in fact, I had to remove a page from the script's list and download it with special:export myself (only 27 MiB history, weird= [14:21] 7ziping items metadata [15:43] is there any tool available for archiving mailing list archives? [15:43] or is the best way just with wget? [17:15] balrog_: what mailing lists? [18:29] WikiLeaks published 2 insurance files [18:29] 1GB one is on IA, 64GB one too? [18:29] http://thepiratebay.se/torrent/5723136/WikiLeaks_insurance [18:29] http://thepiratebay.se/torrent/7050943/WikiLeaks_Insurance_release_02-22-2012 [18:52] they don't seem to have worked very well though [19:02] hi hi [19:03] http://scobleizer.com/2012/08/18/cinchcast-shuts-down-demonstrates-troubles-of-when-you-bet-on-services-you-dont-control/ [19:04] This sounds like another twaud.io [19:39] Nemo_bis: what do you mean? [20:14] hello! :D [20:15] hi [20:17] :) [20:17] * unwhan is taking time to describe the case [20:18] how do i even begin [20:18] death alert ... death alert ... a website is dying [20:18] parts of it are still recoverable via Google Cache [20:19] but it is beyond my capacities to assemble it all from those bits [20:20] I'd start by saying what site it is, ie. what URL it's currently accessible [20:20] also, if it's mostly private or public material [20:20] it is a controversial website on the border of legal... but the admins have taken care to keep it legal [20:20] gimme sec [20:21] so that's the original URL: http://pgforum.freepowerboards.com/ [20:21] meaning: Power Girl Forum [20:22] dedicated to LEGAL discussion about physically strong / athletic UNDERAGE girls [20:23] meagre parts of it are available from the WebArchive [20:23] much much more is available via Google Cache (so far!) [20:23] but we all know that Google Cache won't last [20:24] the good news is that all images have been hosted externally on ImageVenue [20:24] so the crucial part is restoring the HTML layer alone [20:26] admittedly it isn't a website for common audience but nonetheless contained rare, sought after and often censored (lawful) material [20:27] auch, forum + only available cached at google [20:28] i saved some 50 pages manually which is far cry from archiving all [20:29] site:http://pgforum.freepowerboards.com/ well its indexed, does someone have script for pulling all the urls from google and pulling the caches? [20:30] Page 2 of about 5,990 results (0.16 seconds) # [20:30] (an added difficulty for me (personally) is that Google periodically bans me if I fetch too much) [20:32] even Google Cache is not complete, though I would say: fairly complete and perhaps the only good source (while it lasts) [20:34] in future, ideally, i will be better equipped for emergency archiving missions like this [20:34] wget -m goes a long way, even if it gets trapped in a spider trap [20:35] :) [20:40] hmm. not all images were hosted externally. also some messages in older cache copies have been hidden from Google bot. but the total material is fairly well preserved. [20:44] for comparison, WebArchive has only 91 pages [20:46] I think that site's out of luck, unfortunally [20:46] how to limit wget to only webcache.googleuercontent.com ? [20:49] Hmm, I think -A or -R ( ie recursive accept/reject lists) [20:49] that is some creepy shit [20:49] -D for --domains= [20:50] wget -rH -Dwebcache.googleusercontent.com [20:50] is what I was trying... [20:50] Schbirid: as I said, not for common audience. but lawful. [20:51] -A and -R references types of files... [20:52] this stuff needs to be .... done [20:52] basically I see us as needing a tool kit with little modification can be pointed at any url and fired off; [20:52] -D for "done" ;) [20:53] SmileyG: man wget [20:53] Schbirid: yeah errr I have. [20:54] hense why i said -A and -R reference files :s [20:54] thats what it says. [20:54] problem is -Dwebcache.googleusercontent.com is instantly excluding the crawling (not mirroring) of the google search pages? [20:55] Google doesn't block me instantly but often enough to make a bulk download unreliable [20:55] SmileyG: you are missing a space there [20:56] --exclude-domains *.google.com *.google.co.uk doesn't work either... [20:56] Schbirid: space is unnecessary in wget options, I believe [20:56] Schbirid: erm, looking at all the examples on the net there is no space... [20:56] weird [20:56] yup [20:56] well, the man page explains the format iirc [20:56] and both ways get the same problem - stops on the index. [20:57] you did not even post your line, my crystal ball does not pick up any signals [20:57] [21:54:37] < SmileyG> problem is -Dwebcache.googleusercontent.com [20:57] no? [20:58] or [21:50:23] < SmileyG> wget -rH -Dwebcache.googleusercontent.com [20:58] ? [20:58] you did not specify any url, only options [20:59] wget -Dwebcache.googleusercontent.com -m https://www.google.co.uk/#q=site:http://pgforum.freepowerboards.com [20:59] should crawl happily, and only download from teh webcahce from what I understand... [20:59] wget knows no javascript [20:59] except instead it doesn't actually crawl the google search page, it just instantly ignores it. [21:00] :/ [21:00] ah, its using the javascript? [21:00] Ok, so is there an easy way to fix that? [21:01] for one, i can easily save the Google index. it is only 7 pages of about 100 results each. even though it says "5,920 results" :(. [21:02] http://warrick.cs.odu.edu/ ? [21:03] maybe it doesn't display them all... indeed, it seems as if Google had more cached copies than what it shows in the index. [21:04] alard: nice link [21:04] I have no idea if it actually works, but that tool has been discussed here before. [21:05] "Warrick will not be accepting new jobs at the moment! We're sorry for the inconvenience. Warrick has been overloaded with jobs." [21:05] ...but "consider downloading the command line version of Warrick for Unix and Linux". [21:08] Yeah, you can run it yourself. (But do check the warnings about Google's blocking.) [21:13] cinch.fm looks quite horrible on the inside. ASP.NET viewstate, yuck. [21:15] it's good to know about Warwick but it's not good for me right now since I don't have access to any Linux machine. if I had, I could probably complete the mission on my own. of course eventually I will have a Linux machine. [21:21] I remember many years back that Google has offered accounts to people who needed to crawl its search results programmatically. If this is still the case, an account like this could be used for large scale Google Cache scraping without conflict. [21:27] anyway, i can't do anything more today. i need sleep now. i'm glad that i have made it to this channel. i'll be back tomorrow but I'll stay logged in. [21:40] Before I go... one line of note about the Power Girl Forum: "thx a lot this forum is the best forum in the world :-)" – logge1968 (11 July 2011) [23:22] whenever emijrp comes back, tell him 10939375 items