#archiveteam 2012-08-18,Sat

โ†‘back Search

Time Nickname Message
10:00 ๐Ÿ”— Nemo_bis SketchCow: can you delete http://archive.org/details/wikitravel.org please?
11:58 ๐Ÿ”— emijrp exporting metadata for +1,000,000 books on IA
11:59 ๐Ÿ”— emijrp surpass 1gb
12:08 ๐Ÿ”— Nemo_bis cool
12:08 ๐Ÿ”— Nemo_bis emijrp: now re-archiving wikitravel.org
12:09 ๐Ÿ”— Nemo_bis emijrp: because amazingly https://meta.wikimedia.org/wiki/Travel_Guide complained that they don't provide dumps, when it's so easy to make one!
12:09 ๐Ÿ”— emijrp current only?
12:09 ๐Ÿ”— Nemo_bis surely that's not a reson to change hosting
12:09 ๐Ÿ”— Nemo_bis no, all history; that was only a bug...
12:09 ๐Ÿ”— emijrp i mean, wikitravel servers are weak
12:10 ๐Ÿ”— emijrp and full history crashed in the past
12:10 ๐Ÿ”— Nemo_bis crashed how?
12:10 ๐Ÿ”— Nemo_bis I'm getting lots of httplib.IncompleteRead: IncompleteRead(359 bytes read, 1678 more expected)
12:10 ๐Ÿ”— Nemo_bis but you only have to restart
12:11 ๐Ÿ”— Nemo_bis IIRC
12:11 ๐Ÿ”— emijrp crashed or slow as hell and i get tired
12:11 ๐Ÿ”— Nemo_bis hehe
12:25 ๐Ÿ”— emijrp Results: 1 through 50 of 4,929,084 (49.686 secs)
12:26 ๐Ÿ”— emijrp i remember some IA graphs and stats...was there one with item unmbers?
12:58 ๐Ÿ”— Nemo_bis emijrp: in fact, I had to remove a page from the script's list and download it with special:export myself (only 27 MiB history, weird=
14:21 ๐Ÿ”— emijrp 7ziping items metadata
15:43 ๐Ÿ”— balrog_ is there any tool available for archiving mailing list archives?
15:43 ๐Ÿ”— balrog_ or is the best way just with wget?
17:15 ๐Ÿ”— Nemo_bis balrog_: what mailing lists?
18:29 ๐Ÿ”— emijrp WikiLeaks published 2 insurance files
18:29 ๐Ÿ”— emijrp 1GB one is on IA, 64GB one too?
18:29 ๐Ÿ”— emijrp http://thepiratebay.se/torrent/5723136/WikiLeaks_insurance
18:29 ๐Ÿ”— emijrp http://thepiratebay.se/torrent/7050943/WikiLeaks_Insurance_release_02-22-2012
18:52 ๐Ÿ”— Nemo_bis they don't seem to have worked very well though
19:02 ๐Ÿ”— winr4r hi hi
19:03 ๐Ÿ”— winr4r http://scobleizer.com/2012/08/18/cinchcast-shuts-down-demonstrates-troubles-of-when-you-bet-on-services-you-dont-control/
19:04 ๐Ÿ”— chronomex This sounds like another twaud.io
19:39 ๐Ÿ”— emijrp Nemo_bis: what do you mean?
20:14 ๐Ÿ”— unwhan hello! :D
20:15 ๐Ÿ”— Schbirid hi
20:17 ๐Ÿ”— unwhan :)
20:17 ๐Ÿ”— * unwhan is taking time to describe the case
20:18 ๐Ÿ”— unwhan how do i even begin
20:18 ๐Ÿ”— unwhan death alert ... death alert ... a website is dying
20:18 ๐Ÿ”— unwhan parts of it are still recoverable via Google Cache
20:19 ๐Ÿ”— unwhan but it is beyond my capacities to assemble it all from those bits
20:20 ๐Ÿ”— ersi I'd start by saying what site it is, ie. what URL it's currently accessible
20:20 ๐Ÿ”— ersi also, if it's mostly private or public material
20:20 ๐Ÿ”— unwhan it is a controversial website on the border of legal... but the admins have taken care to keep it legal
20:20 ๐Ÿ”— unwhan gimme sec
20:21 ๐Ÿ”— unwhan so that's the original URL: http://pgforum.freepowerboards.com/
20:21 ๐Ÿ”— unwhan meaning: Power Girl Forum
20:22 ๐Ÿ”— unwhan dedicated to LEGAL discussion about physically strong / athletic UNDERAGE girls
20:23 ๐Ÿ”— unwhan meagre parts of it are available from the WebArchive
20:23 ๐Ÿ”— unwhan much much more is available via Google Cache (so far!)
20:23 ๐Ÿ”— unwhan but we all know that Google Cache won't last
20:24 ๐Ÿ”— unwhan the good news is that all images have been hosted externally on ImageVenue
20:24 ๐Ÿ”— unwhan so the crucial part is restoring the HTML layer alone
20:26 ๐Ÿ”— unwhan admittedly it isn't a website for common audience but nonetheless contained rare, sought after and often censored (lawful) material
20:27 ๐Ÿ”— ersi auch, forum + only available cached at google
20:28 ๐Ÿ”— unwhan i saved some 50 pages manually which is far cry from archiving all
20:29 ๐Ÿ”— SmileyG site:http://pgforum.freepowerboards.com/ well its indexed, does someone have script for pulling all the urls from google and pulling the caches?
20:30 ๐Ÿ”— SmileyG Page 2 of about 5,990 results (0.16 seconds) #
20:30 ๐Ÿ”— unwhan (an added difficulty for me (personally) is that Google periodically bans me if I fetch too much)
20:32 ๐Ÿ”— unwhan even Google Cache is not complete, though I would say: fairly complete and perhaps the only good source (while it lasts)
20:34 ๐Ÿ”— unwhan in future, ideally, i will be better equipped for emergency archiving missions like this
20:34 ๐Ÿ”— ersi wget -m goes a long way, even if it gets trapped in a spider trap
20:35 ๐Ÿ”— unwhan :)
20:40 ๐Ÿ”— unwhan hmm. not all images were hosted externally. also some messages in older cache copies have been hidden from Google bot. but the total material is fairly well preserved.
20:44 ๐Ÿ”— unwhan for comparison, WebArchive has only 91 pages
20:46 ๐Ÿ”— ersi I think that site's out of luck, unfortunally
20:46 ๐Ÿ”— SmileyG how to limit wget to only webcache.googleuercontent.com ?
20:49 ๐Ÿ”— ersi Hmm, I think -A or -R ( ie recursive accept/reject lists)
20:49 ๐Ÿ”— Schbirid that is some creepy shit
20:49 ๐Ÿ”— unwhan -D for --domains=
20:50 ๐Ÿ”— SmileyG wget -rH -Dwebcache.googleusercontent.com
20:50 ๐Ÿ”— SmileyG is what I was trying...
20:50 ๐Ÿ”— unwhan Schbirid: as I said, not for common audience. but lawful.
20:51 ๐Ÿ”— SmileyG -A and -R references types of files...
20:52 ๐Ÿ”— SmileyG this stuff needs to be .... done
20:52 ๐Ÿ”— SmileyG basically I see us as needing a tool kit with little modification can be pointed at any url and fired off;
20:52 ๐Ÿ”— unwhan -D for "done" ;)
20:53 ๐Ÿ”— Schbirid SmileyG: man wget
20:53 ๐Ÿ”— SmileyG Schbirid: yeah errr I have.
20:54 ๐Ÿ”— SmileyG hense why i said -A and -R reference files :s
20:54 ๐Ÿ”— SmileyG thats what it says.
20:54 ๐Ÿ”— SmileyG problem is -Dwebcache.googleusercontent.com is instantly excluding the crawling (not mirroring) of the google search pages?
20:55 ๐Ÿ”— unwhan Google doesn't block me instantly but often enough to make a bulk download unreliable
20:55 ๐Ÿ”— Schbirid SmileyG: you are missing a space there
20:56 ๐Ÿ”— SmileyG --exclude-domains *.google.com *.google.co.uk doesn't work either...
20:56 ๐Ÿ”— unwhan Schbirid: space is unnecessary in wget options, I believe
20:56 ๐Ÿ”— SmileyG Schbirid: erm, looking at all the examples on the net there is no space...
20:56 ๐Ÿ”— Schbirid weird
20:56 ๐Ÿ”— SmileyG yup
20:56 ๐Ÿ”— Schbirid well, the man page explains the format iirc
20:56 ๐Ÿ”— SmileyG and both ways get the same problem - stops on the index.
20:57 ๐Ÿ”— Schbirid you did not even post your line, my crystal ball does not pick up any signals
20:57 ๐Ÿ”— SmileyG [21:54:37] < SmileyG> problem is -Dwebcache.googleusercontent.com
20:57 ๐Ÿ”— SmileyG no?
20:58 ๐Ÿ”— SmileyG or [21:50:23] < SmileyG> wget -rH -Dwebcache.googleusercontent.com
20:58 ๐Ÿ”— SmileyG ?
20:58 ๐Ÿ”— Schbirid you did not specify any url, only options
20:59 ๐Ÿ”— SmileyG wget -Dwebcache.googleusercontent.com -m https://www.google.co.uk/#q=site:http://pgforum.freepowerboards.com
20:59 ๐Ÿ”— SmileyG should crawl happily, and only download from teh webcahce from what I understand...
20:59 ๐Ÿ”— Schbirid wget knows no javascript
20:59 ๐Ÿ”— SmileyG except instead it doesn't actually crawl the google search page, it just instantly ignores it.
21:00 ๐Ÿ”— unwhan :/
21:00 ๐Ÿ”— SmileyG ah, its using the javascript?
21:00 ๐Ÿ”— SmileyG Ok, so is there an easy way to fix that?
21:01 ๐Ÿ”— unwhan for one, i can easily save the Google index. it is only 7 pages of about 100 results each. even though it says "5,920 results" :(.
21:02 ๐Ÿ”— alard http://warrick.cs.odu.edu/ ?
21:03 ๐Ÿ”— unwhan maybe it doesn't display them all... indeed, it seems as if Google had more cached copies than what it shows in the index.
21:04 ๐Ÿ”— unwhan alard: nice link
21:04 ๐Ÿ”— alard I have no idea if it actually works, but that tool has been discussed here before.
21:05 ๐Ÿ”— unwhan "Warrick will not be accepting new jobs at the moment! We're sorry for the inconvenience. Warrick has been overloaded with jobs."
21:05 ๐Ÿ”— unwhan ...but "consider downloading the command line version of Warrick for Unix and Linux".
21:08 ๐Ÿ”— alard Yeah, you can run it yourself. (But do check the warnings about Google's blocking.)
21:13 ๐Ÿ”— alard cinch.fm looks quite horrible on the inside. ASP.NET viewstate, yuck.
21:15 ๐Ÿ”— unwhan it's good to know about Warwick but it's not good for me right now since I don't have access to any Linux machine. if I had, I could probably complete the mission on my own. of course eventually I will have a Linux machine.
21:21 ๐Ÿ”— unwhan I remember many years back that Google has offered accounts to people who needed to crawl its search results programmatically. If this is still the case, an account like this could be used for large scale Google Cache scraping without conflict.
21:27 ๐Ÿ”— unwhan anyway, i can't do anything more today. i need sleep now. i'm glad that i have made it to this channel. i'll be back tomorrow but I'll stay logged in.
21:40 ๐Ÿ”— unwhan Before I go... one line of note about the Power Girl Forum: "thx a lot this forum is the best forum in the world :-)" รขย€ย“ logge1968 (11 July 2011)
23:22 ๐Ÿ”— underscor whenever emijrp comes back, tell him 10939375 items

irclogger-viewer