#archiveteam 2012-08-18,Sat

↑back Search

Time	Nickname	Message
10:00 ^🔗	Nemo_bis	SketchCow: can you delete http://archive.org/details/wikitravel.org please?
11:58 ^🔗	emijrp	exporting metadata for +1,000,000 books on IA
11:59 ^🔗	emijrp	surpass 1gb
12:08 ^🔗	Nemo_bis	cool
12:08 ^🔗	Nemo_bis	emijrp: now re-archiving wikitravel.org
12:09 ^🔗	Nemo_bis	emijrp: because amazingly https://meta.wikimedia.org/wiki/Travel_Guide complained that they don't provide dumps, when it's so easy to make one!
12:09 ^🔗	emijrp	current only?
12:09 ^🔗	Nemo_bis	surely that's not a reson to change hosting
12:09 ^🔗	Nemo_bis	no, all history; that was only a bug...
12:09 ^🔗	emijrp	i mean, wikitravel servers are weak
12:10 ^🔗	emijrp	and full history crashed in the past
12:10 ^🔗	Nemo_bis	crashed how?
12:10 ^🔗	Nemo_bis	I'm getting lots of httplib.IncompleteRead: IncompleteRead(359 bytes read, 1678 more expected)
12:10 ^🔗	Nemo_bis	but you only have to restart
12:11 ^🔗	Nemo_bis	IIRC
12:11 ^🔗	emijrp	crashed or slow as hell and i get tired
12:11 ^🔗	Nemo_bis	hehe
12:25 ^🔗	emijrp	Results: 1 through 50 of 4,929,084 (49.686 secs)
12:26 ^🔗	emijrp	i remember some IA graphs and stats...was there one with item unmbers?
12:58 ^🔗	Nemo_bis	emijrp: in fact, I had to remove a page from the script's list and download it with special:export myself (only 27 MiB history, weird=
14:21 ^🔗	emijrp	7ziping items metadata
15:43 ^🔗	balrog_	is there any tool available for archiving mailing list archives?
15:43 ^🔗	balrog_	or is the best way just with wget?
17:15 ^🔗	Nemo_bis	balrog_: what mailing lists?
18:29 ^🔗	emijrp	WikiLeaks published 2 insurance files
18:29 ^🔗	emijrp	1GB one is on IA, 64GB one too?
18:29 ^🔗	emijrp	http://thepiratebay.se/torrent/5723136/WikiLeaks_insurance
18:29 ^🔗	emijrp	http://thepiratebay.se/torrent/7050943/WikiLeaks_Insurance_release_02-22-2012
18:52 ^🔗	Nemo_bis	they don't seem to have worked very well though
19:02 ^🔗	winr4r	hi hi
19:03 ^🔗	winr4r	http://scobleizer.com/2012/08/18/cinchcast-shuts-down-demonstrates-troubles-of-when-you-bet-on-services-you-dont-control/
19:04 ^🔗	chronomex	This sounds like another twaud.io
19:39 ^🔗	emijrp	Nemo_bis: what do you mean?
20:14 ^🔗	unwhan	hello! :D
20:15 ^🔗	Schbirid	hi
20:17 ^🔗	unwhan	:)
20:17 ^🔗	*	unwhan is taking time to describe the case
20:18 ^🔗	unwhan	how do i even begin
20:18 ^🔗	unwhan	death alert ... death alert ... a website is dying
20:18 ^🔗	unwhan	parts of it are still recoverable via Google Cache
20:19 ^🔗	unwhan	but it is beyond my capacities to assemble it all from those bits
20:20 ^🔗	ersi	I'd start by saying what site it is, ie. what URL it's currently accessible
20:20 ^🔗	ersi	also, if it's mostly private or public material
20:20 ^🔗	unwhan	it is a controversial website on the border of legal... but the admins have taken care to keep it legal
20:20 ^🔗	unwhan	gimme sec
20:21 ^🔗	unwhan	so that's the original URL: http://pgforum.freepowerboards.com/
20:21 ^🔗	unwhan	meaning: Power Girl Forum
20:22 ^🔗	unwhan	dedicated to LEGAL discussion about physically strong / athletic UNDERAGE girls
20:23 ^🔗	unwhan	meagre parts of it are available from the WebArchive
20:23 ^🔗	unwhan	much much more is available via Google Cache (so far!)
20:23 ^🔗	unwhan	but we all know that Google Cache won't last
20:24 ^🔗	unwhan	the good news is that all images have been hosted externally on ImageVenue
20:24 ^🔗	unwhan	so the crucial part is restoring the HTML layer alone
20:26 ^🔗	unwhan	admittedly it isn't a website for common audience but nonetheless contained rare, sought after and often censored (lawful) material
20:27 ^🔗	ersi	auch, forum + only available cached at google
20:28 ^🔗	unwhan	i saved some 50 pages manually which is far cry from archiving all
20:29 ^🔗	SmileyG	site:http://pgforum.freepowerboards.com/ well its indexed, does someone have script for pulling all the urls from google and pulling the caches?
20:30 ^🔗	SmileyG	Page 2 of about 5,990 results (0.16 seconds) #
20:30 ^🔗	unwhan	(an added difficulty for me (personally) is that Google periodically bans me if I fetch too much)
20:32 ^🔗	unwhan	even Google Cache is not complete, though I would say: fairly complete and perhaps the only good source (while it lasts)
20:34 ^🔗	unwhan	in future, ideally, i will be better equipped for emergency archiving missions like this
20:34 ^🔗	ersi	wget -m goes a long way, even if it gets trapped in a spider trap
20:35 ^🔗	unwhan	:)
20:40 ^🔗	unwhan	hmm. not all images were hosted externally. also some messages in older cache copies have been hidden from Google bot. but the total material is fairly well preserved.
20:44 ^🔗	unwhan	for comparison, WebArchive has only 91 pages
20:46 ^🔗	ersi	I think that site's out of luck, unfortunally
20:46 ^🔗	SmileyG	how to limit wget to only webcache.googleuercontent.com ?
20:49 ^🔗	ersi	Hmm, I think -A or -R ( ie recursive accept/reject lists)
20:49 ^🔗	Schbirid	that is some creepy shit
20:49 ^🔗	unwhan	-D for --domains=
20:50 ^🔗	SmileyG	wget -rH -Dwebcache.googleusercontent.com
20:50 ^🔗	SmileyG	is what I was trying...
20:50 ^🔗	unwhan	Schbirid: as I said, not for common audience. but lawful.
20:51 ^🔗	SmileyG	-A and -R references types of files...
20:52 ^🔗	SmileyG	this stuff needs to be .... done
20:52 ^🔗	SmileyG	basically I see us as needing a tool kit with little modification can be pointed at any url and fired off;
20:52 ^🔗	unwhan	-D for "done" ;)
20:53 ^🔗	Schbirid	SmileyG: man wget
20:53 ^🔗	SmileyG	Schbirid: yeah errr I have.
20:54 ^🔗	SmileyG	hense why i said -A and -R reference files :s
20:54 ^🔗	SmileyG	thats what it says.
20:54 ^🔗	SmileyG	problem is -Dwebcache.googleusercontent.com is instantly excluding the crawling (not mirroring) of the google search pages?
20:55 ^🔗	unwhan	Google doesn't block me instantly but often enough to make a bulk download unreliable
20:55 ^🔗	Schbirid	SmileyG: you are missing a space there
20:56 ^🔗	SmileyG	--exclude-domains .google.com .google.co.uk doesn't work either...
20:56 ^🔗	unwhan	Schbirid: space is unnecessary in wget options, I believe
20:56 ^🔗	SmileyG	Schbirid: erm, looking at all the examples on the net there is no space...
20:56 ^🔗	Schbirid	weird
20:56 ^🔗	SmileyG	yup
20:56 ^🔗	Schbirid	well, the man page explains the format iirc
20:56 ^🔗	SmileyG	and both ways get the same problem - stops on the index.
20:57 ^🔗	Schbirid	you did not even post your line, my crystal ball does not pick up any signals
20:57 ^🔗	SmileyG	[21:54:37] < SmileyG> problem is -Dwebcache.googleusercontent.com
20:57 ^🔗	SmileyG	no?
20:58 ^🔗	SmileyG	or [21:50:23] < SmileyG> wget -rH -Dwebcache.googleusercontent.com
20:58 ^🔗	SmileyG	?
20:58 ^🔗	Schbirid	you did not specify any url, only options
20:59 ^🔗	SmileyG	wget -Dwebcache.googleusercontent.com -m https://www.google.co.uk/#q=site:http://pgforum.freepowerboards.com
20:59 ^🔗	SmileyG	should crawl happily, and only download from teh webcahce from what I understand...
20:59 ^🔗	Schbirid	wget knows no javascript
20:59 ^🔗	SmileyG	except instead it doesn't actually crawl the google search page, it just instantly ignores it.
21:00 ^🔗	unwhan	:/
21:00 ^🔗	SmileyG	ah, its using the javascript?
21:00 ^🔗	SmileyG	Ok, so is there an easy way to fix that?
21:01 ^🔗	unwhan	for one, i can easily save the Google index. it is only 7 pages of about 100 results each. even though it says "5,920 results" :(.
21:02 ^🔗	alard	http://warrick.cs.odu.edu/ ?
21:03 ^🔗	unwhan	maybe it doesn't display them all... indeed, it seems as if Google had more cached copies than what it shows in the index.
21:04 ^🔗	unwhan	alard: nice link
21:04 ^🔗	alard	I have no idea if it actually works, but that tool has been discussed here before.
21:05 ^🔗	unwhan	"Warrick will not be accepting new jobs at the moment! We're sorry for the inconvenience. Warrick has been overloaded with jobs."
21:05 ^🔗	unwhan	...but "consider downloading the command line version of Warrick for Unix and Linux".
21:08 ^🔗	alard	Yeah, you can run it yourself. (But do check the warnings about Google's blocking.)
21:13 ^🔗	alard	cinch.fm looks quite horrible on the inside. ASP.NET viewstate, yuck.
21:15 ^🔗	unwhan	it's good to know about Warwick but it's not good for me right now since I don't have access to any Linux machine. if I had, I could probably complete the mission on my own. of course eventually I will have a Linux machine.
21:21 ^🔗	unwhan	I remember many years back that Google has offered accounts to people who needed to crawl its search results programmatically. If this is still the case, an account like this could be used for large scale Google Cache scraping without conflict.
21:27 ^🔗	unwhan	anyway, i can't do anything more today. i need sleep now. i'm glad that i have made it to this channel. i'll be back tomorrow but I'll stay logged in.
21:40 ^🔗	unwhan	Before I go... one line of note about the Power Girl Forum: "thx a lot this forum is the best forum in the world :-)" â logge1968 (11 July 2011)
23:22 ^🔗	underscor	whenever emijrp comes back, tell him 10939375 items

irclogger-viewer