#archiveteam 2014-02-10,Mon

↑back Search

Time	Nickname	Message
00:53 ^🔗	Leo_TCK	but tell me what have you found out
00:53 ^🔗	Leo_TCK	oops
00:53 ^🔗	Leo_TCK	sry wrong channel
00:53 ^🔗	Leo_TCK	didnt mean to say that here
05:54 ^🔗	namespace	So earlier you guys said that the only way to grab the data from Googles awful site. (Not gonna lie, I think google groups is the least usable site I've ever seen from a professional company that should know better, let alone the worst usenet client.)
05:55 ^🔗	namespace	was to use phantomJS, would this be done through grabs of the state of the webkit client while it's on a particular comment tree or?
05:57 ^🔗	namespace	Is there somewhere I can go to read about how previous JS heavy sites have been grabbed?
06:07 ^🔗	ivan`	namespace: using a browser engine is probably a bad idea given that there are hundreds of millions of pages
06:09 ^🔗	ivan`	Google Reader was JavaScript-heavy and that was grabbed via a JSON API
06:09 ^🔗	ivan`	you'd have to figure out what the browser is requesting and write some Python to make similar requests
06:11 ^🔗	namespace	ivan`: Actually interestingly enough, I'm reading a stackoverflow thread that suggests there may be a api for google groups.
06:11 ^🔗	ivan`	Leo_TCK: it wasn't in any of the .zips in my ut-files grab, sorry
06:11 ^🔗	namespace	https://stackoverflow.com/questions/3757793/api-for-google-groups
06:11 ^🔗	namespace	Nope, nevermind it's just for getting lists of users. -_-
06:14 ^🔗	SketchCow	Hey, maniacs.
06:14 ^🔗	namespace	SketchCow: Hey. Stupid question, are you Jason Scott?
06:15 ^🔗	ivan`	/whois SketchCow ;)
06:15 ^🔗	namespace	ivan`: Oh right, I forgot IRC has that feature.
06:15 ^🔗	*	SketchCow sings the phantom of the opera soundtrack
06:16 ^🔗	SketchCow	iiiiiii aammmm the pahnnntoooommm of the aaarrrrrchhhiives
06:16 ^🔗	namespace	Seriously though dude, you're awesome. You're this modern monuments man saving history from the clutches of megacorps and "dewey eyed fucksticks".
06:17 ^🔗	namespace	Now back to groups...
06:18 ^🔗	namespace	So part of the problem I'm having is that if you take a look at the stuff that the GG client spits back at you when you try and inspect it, it's all obfuscated or minimized garbage.
06:20 ^🔗	namespace	(Or rather the source code is at least, the actual network requests just don't lend themselves to comprehension.)
06:20 ^🔗	yipdw	that's actually why I suggested a browser engine :P
06:21 ^🔗	yipdw	it eliminates the need to know that layer (which is subject to change anyway)
06:21 ^🔗	yipdw	now you do have the scaling problem
06:21 ^🔗	yipdw	I'd be interested to see how well the Warrior or something like it could cope
06:24 ^🔗	ivan`	if you go with the browser engine, you're counting on 1) Google's JavaScript actually working as you want it (huge threads, no intentional blocking features added by goog); 2) warriors having hundreds of MB free for webkit/blink
06:24 ^🔗	ivan`	someone like Kenshin can do hundreds of requests per second but probably not with a browser
06:24 ^🔗	namespace	So I have zero experience with the web stack. (Http/HTML/JS/etc). Should I start with the http RFC or?
06:26 ^🔗	ivan`	http://www.garshol.priv.no/download/text/http-tut.html
06:27 ^🔗	ivan`	http://www.jmarshall.com/easy/http/
06:27 ^🔗	ivan`	I don't think there's really that much to know; just get your URL and request headers right
06:27 ^🔗	ivan`	wget or wpull or requests will handle the HTTP requesting for you
06:28 ^🔗	ivan`	reverse-engineering google's blobs of JS is hopefully unnecessary
06:31 ^🔗	ivan`	if you really want to learn JS stuff, https://sivers.org/learn-js suggests Professional JavaScript for Web Developers, 3rd Edition
06:33 ^🔗	namespace	"RFC 2068 describes HTTP/1.1, which extends and improves HTTP/1.0 in a number of areas. Very few browsers support it (MSIE 4.0 is the only one known to the author), but servers are beginning to do so." This is pretty old guide. :P
06:33 ^🔗	aggrosk	I recommend "Javascript: The Good Parts" if you have previous programming experience.
06:35 ^🔗	namespace	ivan`: So the general problem with that approach is that I can't even tell what I'm requesting.
06:35 ^🔗	namespace	It's not like GG has a sane directory structure.
06:36 ^🔗	namespace	Most of the requests my browser makes supposedly have no content, etc.
06:36 ^🔗	namespace	Or how when I scroll I always request the same thing from the server.
06:36 ^🔗	namespace	Even though I get different content.
06:38 ^🔗	ivan`	yeah, I just noticed that
06:38 ^🔗	ivan`	I wonder if some weird SPDY or HTTP/TCP session stuff is going on
06:38 ^🔗	namespace	Maybe. Probably need to pull out wireshark to investigate this one.
06:39 ^🔗	*	namespace grumbles when google refuses to serve me anything but https
06:43 ^🔗	namespace	Okay so apparently I can coax wireshark into decrypting SSL traffic if I have a copy of the decryption key. How do I extract the key my browser sends google for them to securely send me data?
06:44 ^🔗	namespace	(That is how SSL works right? wikis)
06:44 ^🔗	ivan`	try scrolling around Google Groups in Firefox and you'll get much better results
06:45 ^🔗	ivan`	it uses POST and gets a normal-looking response body
06:45 ^🔗	ivan`	also, right click the requests scrolling by in the Console tab and check "Log Request and Response Bodies"
06:46 ^🔗	namespace	I am using firefox.
06:47 ^🔗	ivan`	oh, I only saw the No Content stuff in Chrome
06:47 ^🔗	namespace	I'm seeing it in Firefox too.
06:47 ^🔗	ivan`	I'm using a SOCKS5 proxy in Firefox, but I do have spdy and websocket enabled
06:48 ^🔗	ivan`	let me try in a non-SOCKS5 FF
06:48 ^🔗	namespace	Well spddy is working here, so it's not that.
06:49 ^🔗	namespace	Websockets also work. (Both of these determined through online "test my browser" style stuff.)
06:50 ^🔗	ivan`	in a clean Firefox 27 profile on Windows I'm seeing normal POST requests to https://groups.google.com/forum/msg and https://groups.google.com/forum/msg_bkg
06:51 ^🔗	ivan`	requests to /csi are No Content though
06:51 ^🔗	namespace	ivan`: I think I might just be reading the wrong things. For one thing until you mentioned it I assumed post requests were my browser sending stuff, and that nothing actually gets to me through that.
06:53 ^🔗	ivan`	your browser gets a response for any kind of request (even if it's No Content)
06:54 ^🔗	ivan`	also I noticed it's using x-gwt-rpc and this is the document that describes the protocol https://docs.google.com/document/d/1eG0YocsYYbNAtivkLtcaiEE5IOF5u4LUol8-LL0TIKU/preview?pli=1&sle=true#heading=h.lczgog5ezfjp
06:54 ^🔗	ivan`	you may not need all of that information yet
06:55 ^🔗	ivan`	the GWT source code is also available if you need to figure out how the stranger No Content-response transport works, but hopefully you can just do the POSTs that it does in Firefox
06:56 ^🔗	ivan`	you should definitely click on those POST requests in Firefox's inspector and look at the response body at the end of the popup window
06:57 ^🔗	namespace	Okay, I'll see how far I can go with the help given so far, read the rest of that HTTP tutorial, etc.
12:15 ^🔗	Nemo_bis	GWT is GlamWikiToolset for me :)
14:37 ^🔗	joepie91	urgent dump of bitcoin.it wiki needed, but no time to run wikidump myself right now
14:38 ^🔗	joepie91	it's owned by the owner of mt. gox
14:38 ^🔗	joepie91	and it looks like shit might be going down very soomn
14:38 ^🔗	joepie91	soon *
14:38 ^🔗	Dud1	Is it a big wiki?
14:41 ^🔗	Nemo_bis	Dud1: no, minuscule: wanna do? https://it.bitcoin.it/wiki/Speciale:Statistiche
14:42 ^🔗	Dud1	Have it started already ;)
14:42 ^🔗	Nemo_bis	Ah beware there are multiple, still smallish https://en.bitcoin.it/wiki/Special:Statistics
14:46 ^🔗	midas	mediawiki should have a warc handle. "archive this file and done!"
14:52 ^🔗	Nemo_bis	midas: I'm not sure what benefits it would have compared to HTML export, but you could file that bug :) they're working on HTML export now
14:53 ^🔗	Nemo_bis	midas: I think this would be best bet https://bugzilla.wikimedia.org/enter_bug.cgi?product=openZIM&component=zimwriter
14:53 ^🔗	midas	oh in that case i didnt say anything, if the html export is a public function :p
14:53 ^🔗	Nemo_bis	There are two separate public HTML export features, but they're very ugly
14:54 ^🔗	Nemo_bis	The kiwix.org guy is working on making them actually work but I'm not sure those endpoints will be open and they won't be in core MediaWiki anyway.
14:59 ^🔗	Nemo_bis	Hence why I think a feature request in that component will be useful. :)
15:06 ^🔗	midas	yeah, would be cool
15:06 ^🔗	midas	specialpage:archiveteam
15:06 ^🔗	midas	:p
15:06 ^🔗	DFJustin	http://www.egreetings.com/goodbye
15:07 ^🔗	midas	... what is it in feburary.
15:08 ^🔗	namespace	10th
15:08 ^🔗	namespace	Basically if we want this one we've got two days.
15:09 ^🔗	midas	i think this could be grabbed using archivebot, but yipdw could confirm or debunk that
15:09 ^🔗	DFJustin	most of it probably
15:18 ^🔗	namespace	Good luck guys.
16:04 ^🔗	Nemo_bis	How do I get Wayback to archive such a thing in a meaningful way? https://toolserver.org/~nemobis/crstats/robla/
17:23 ^🔗	yipdw	midas: you might as well
17:23 ^🔗	yipdw	if it runs away we just abort
17:32 ^🔗	midas	i think DFJustin is grabbing it already using the bot :)
17:42 ^🔗	Dud2	On bitcoin.it it seems to be stuck on one particular page (for 15+ minutes)
17:44 ^🔗	Nemo_bis	is the ram usage increasing?
17:44 ^🔗	Nemo_bis	or do you see network activity
18:59 ^🔗	Nemo_bis	deathy: do you still have that huge server at your disposal? :)
20:03 ^🔗	beardicus	is there a dogster/catster channel?
20:15 ^🔗	yipdw	beardicus: #rawdogster
20:15 ^🔗	beardicus	thanks yipdw.
20:17 ^🔗	Leo_TCK	what is a dogster
20:17 ^🔗	SketchCow	A pet social media site.
20:17 ^🔗	SketchCow	Lasted 10 years, is being shut down and turned into some piece of crap by the new owners.
20:18 ^🔗	SketchCow	Millions of pet profiles and messages to be deleted.
20:18 ^🔗	Leo_TCK	oh
20:18 ^🔗	arkiver	and there is catster, that one is for the cat's
20:18 ^🔗	SketchCow	http://www.dogster.com/
20:18 ^🔗	arkiver	http://www.catster.com/
20:39 ^🔗	Dud1	Nemo_bis: Sorry yeah there is network activity, it just seems to be a fairly big title
20:41 ^🔗	Nemo_bis	Dud1: when a page is big and the site slow, it can take a while to download the whole history; the script is a bit stupid about that
20:42 ^🔗	Atluxity	I just sent a mail about archiveteam to the norwegian computer-history mailinglist. Might be someone who cares to join with a warrior there
20:43 ^🔗	Dud1	It just finised it, there was 1851 edits to it. It doubled the size of the file.
20:45 ^🔗	arkiver	SketchCow: How much of youtube or how many percent of youtube has the IA saved now?
20:57 ^🔗	DFJustin	so I come across wikis from time to time which archivebot is not ideal for, what's the best way to nominate them for wikiteam
20:58 ^🔗	Nemo_bis	DFJustin: downloading them yourself :P
20:58 ^🔗	Nemo_bis	DFJustin: or at least add them to https://wikiapiary.com/
20:58 ^🔗	Nemo_bis	There's a nice form for that, you only need to (register and) enter name and API URL
21:14 ^🔗	SketchCow	arkiver: Well, that may be the stupidest question asked on here since we banned that french kid
21:14 ^🔗	SketchCow	Thinking......
21:14 ^🔗	SketchCow	Yep.
21:17 ^🔗	Jonimus	Does the IA even have enough bandwidth to download youtube as fast as they are being uploaded? Trying to back up youtube would be neigh impossible.
21:17 ^🔗	arkiver	SketchCow: haha, so what stupid question had the french kid asked?
21:17 ^🔗	RedType	The answer is: all of youtube. A mirror of youtube can be found at www.youtube.co.uk/
21:18 ^🔗	Nemo_bis	Last week's income is quite good, must be FTP sites? :) http://teamarchive0.fnf.archive.org:8088/mrtg/networkv2.html
21:20 ^🔗	SketchCow	The french kid asked something along the lines of how could we not use his Windows/DOS filename-structured mess of Geocities saves.
21:20 ^🔗	SketchCow	Youtube adds 48 hours of video every minute.
21:20 ^🔗	SketchCow	Next stupid question.
21:20 ^🔗	arkiver	hmm
21:20 ^🔗	SketchCow	That number may be old.
21:20 ^🔗	arkiver	thinking already
21:20 ^🔗	arkiver	;)
21:21 ^🔗	RedType	SketchCow: they claim its >100 hours per minute now
21:22 ^🔗	Dud1	And probably about 99.9 hours of that is total crap
21:26 ^🔗	SketchCow	https://twitter.com/johnv/status/432988213800497152
21:26 ^🔗	SketchCow	This guy is not sending me a birthday card
21:28 ^🔗	RedType	SketchCow:
21:28 ^🔗	RedType	Archive Team: better me than your family
21:29 ^🔗	arkiver	John Vars:
21:29 ^🔗	arkiver	Milestone: My first twitter fight!
21:29 ^🔗	arkiver	:P
21:33 ^🔗	yipdw	"better me than your family" hahaha
21:44 ^🔗	joepie91	SketchCow:
21:44 ^🔗	joepie91	.tw https://twitter.com/johnv/status/432989058042560512
21:45 ^🔗	joepie91	oh
21:45 ^🔗	joepie91	no botpie here
21:50 ^🔗	SketchCow	Dogster editor in chief started following me.
21:51 ^🔗	SketchCow	651 followers - now there's someone who "got" social media
21:52 ^🔗	joepie91	haha
22:11 ^🔗	deathy	Nemo_bis: just noticed.. the 40-ish GB of RAM one, yeah
22:12 ^🔗	Nemo_bis	deathy: nice :) how many cores and how much disk?
22:13 ^🔗	ersi	Dogster, bwahaha
22:14 ^🔗	deathy	Nemo_bis: 1.7 TB free disk, 4 'real' cores ( i7-920 )
22:14 ^🔗	Nemo_bis	SketchCow: he feels so insulted that he wants to be sure he doesn't miss any insult you make him without mentioning him? :)
22:14 ^🔗	Nemo_bis	deathy: wonderful! Do you feel like repackaging the geocities torrent? :D
22:15 ^🔗	SmileyG	Nemo_bis: lol how big??
22:16 ^🔗	Nemo_bis	There are some scripts to download and clean up, then a mere 7z a -t7z -m0=BZip2 -mmt=8 -mx9 should do the job (or 4 if you don't use hyperthreading)
22:16 ^🔗	Nemo_bis	SmileyG: it's not even 1 TB
22:16 ^🔗	Nemo_bis	I'd really like to have a geocities archive I can happily bzgrep ! I NEED IT
22:17 ^🔗	SketchCow	Dragan is working on that!
22:17 ^🔗	SketchCow	he's been cleaning up the geocities mess for years now.
22:17 ^🔗	deathy	Nemo_bis: I can, but kind of stressed with some things until Wednesday afternoon at least. If still needed remind me after that
22:19 ^🔗	Nemo_bis	SketchCow: I know, but is he still working on publishing another "patched version"? I doubt he'd mind being helped :)
22:19 ^🔗	Nemo_bis	deathy: it's not urgent at all :) the scripts are at https://github.com/despens/Geocities , found thanks to SketchCow's pointers.
22:20 ^🔗	SketchCow	He's moving to the US, he'll have more time.
22:21 ^🔗	Nemo_bis	Oh. Maybe he still likes some testers?
22:21 ^🔗	Nemo_bis	The first 8 steps don't require a database. https://github.com/despens/Geocities/tree/master/scripts/geocities.archiveteam.torrent
22:24 ^🔗	deathy	Nemo_bis: sent myself a reminder mail, will check back with you on Wed/Thu :)
22:26 ^🔗	Nemo_bis	deathy: great, if you get to it let's check the details better, especially the exact compression command
22:45 ^🔗	RedType	14:43 <@tantek> another silo death coming up: http://zootool.com/goodbye
23:02 ^🔗	joepie91	RedType: refreshingly honest shutdown message
23:03 ^🔗	RedType	"Zootool made us realize that the general idea of running a central service is nothing we believe in any longer. Your data should belong to you and shouldn't be stored on our servers. You shouldn't have to rely on us or on any other service to keep your data secure and online."
23:03 ^🔗	RedType	" On March, 15th all data and images will be deleted forever."
23:03 ^🔗	midas	impressive.
23:04 ^🔗	BlueMax	"hi, we shouldn't be holding your car in this storage locker, so in a couple of weeks we're going to blow your car the fuck up"
23:05 ^🔗	midas	what the fuck is wrong with people these days. cant they send us a bloody tweet in advance?
23:06 ^🔗	Dud1	It's not going down until the 15th of March.
23:06 ^🔗	BlueMax	still not very long until it all gets deleted
23:08 ^🔗	ivan`	zootool has 404ed their tags pages already
23:09 ^🔗	ivan`	maybe some convincing over twitter could solve that http://zootool.com/about/
23:11 ^🔗	midas	tweet send.
23:11 ^🔗	midas	to bastian
23:11 ^🔗	midas	he is the only one thats really active on twitter
23:23 ^🔗	garyrh	I got the my opera username crawl code on github
23:23 ^🔗	garyrh	https://github.com/MithrandirAgain/myopera-username-grab
23:23 ^🔗	garyrh	thoughts?
23:28 ^🔗	yipdw	garyrh: how is that actually writing anything to disk?
23:29 ^🔗	yipdw	I see
23:29 ^🔗	yipdw	oh, wait
23:29 ^🔗	yipdw	so it's not generating WARCs
23:29 ^🔗	RedType	looks like the zootool api is still live
23:29 ^🔗	garyrh	no, it's just crawling for usernames
23:47 ^🔗	midas	right, load was a tad high
23:49 ^🔗	midas	SketchCow: making friends with SAY media? :p
23:55 ^🔗	Nemo_bis	http://archiveteam.org/index.php?title=My_Opera looks bigger than I thought, not much time

irclogger-viewer