[00:53] but tell me what have you found out [00:53] oops [00:53] sry wrong channel [00:53] didnt mean to say that here [05:54] So earlier you guys said that the only way to grab the data from Googles awful site. (Not gonna lie, I think google groups is the least usable site I've ever seen from a professional company that should know better, let alone the worst usenet client.) [05:55] was to use phantomJS, would this be done through grabs of the state of the webkit client while it's on a particular comment tree or? [05:57] Is there somewhere I can go to read about how previous JS heavy sites have been grabbed? [06:07] namespace: using a browser engine is probably a bad idea given that there are hundreds of millions of pages [06:09] Google Reader was JavaScript-heavy and that was grabbed via a JSON API [06:09] you'd have to figure out what the browser is requesting and write some Python to make similar requests [06:11] ivan`: Actually interestingly enough, I'm reading a stackoverflow thread that suggests there may be a api for google groups. [06:11] Leo_TCK: it wasn't in any of the .zips in my ut-files grab, sorry [06:11] https://stackoverflow.com/questions/3757793/api-for-google-groups [06:11] Nope, nevermind it's just for getting lists of users. -_- [06:14] Hey, maniacs. [06:14] SketchCow: Hey. Stupid question, are you Jason Scott? [06:15] /whois SketchCow ;) [06:15] ivan`: Oh right, I forgot IRC has that feature. [06:15] * SketchCow sings the phantom of the opera soundtrack [06:16] iiiiiii aammmm the pahnnntoooommm of the aaarrrrrchhhiives [06:16] Seriously though dude, you're awesome. You're this modern monuments man saving history from the clutches of megacorps and "dewey eyed fucksticks". [06:17] Now back to groups... [06:18] So part of the problem I'm having is that if you take a look at the stuff that the GG client spits back at you when you try and inspect it, it's all obfuscated or minimized garbage. [06:20] (Or rather the source code is at least, the actual network requests just don't lend themselves to comprehension.) [06:20] that's actually why I suggested a browser engine :P [06:21] it eliminates the need to know that layer (which is subject to change anyway) [06:21] now you do have the scaling problem [06:21] I'd be interested to see how well the Warrior or something like it could cope [06:24] if you go with the browser engine, you're counting on 1) Google's JavaScript actually working as you want it (huge threads, no intentional blocking features added by goog); 2) warriors having hundreds of MB free for webkit/blink [06:24] someone like Kenshin can do hundreds of requests per second but probably not with a browser [06:24] So I have zero experience with the web stack. (Http/HTML/JS/etc). Should I start with the http RFC or? [06:26] http://www.garshol.priv.no/download/text/http-tut.html [06:27] http://www.jmarshall.com/easy/http/ [06:27] I don't think there's really that much to know; just get your URL and request headers right [06:27] wget or wpull or requests will handle the HTTP requesting for you [06:28] reverse-engineering google's blobs of JS is hopefully unnecessary [06:31] if you really want to learn JS stuff, https://sivers.org/learn-js suggests Professional JavaScript for Web Developers, 3rd Edition [06:33] "RFC 2068 describes HTTP/1.1, which extends and improves HTTP/1.0 in a number of areas. Very few browsers support it (MSIE 4.0 is the only one known to the author), but servers are beginning to do so." This is pretty old guide. :P [06:33] I recommend "Javascript: The Good Parts" if you have previous programming experience. [06:35] ivan`: So the general problem with that approach is that I can't even tell what I'm requesting. [06:35] It's not like GG has a sane directory structure. [06:36] Most of the requests my browser makes supposedly have no content, etc. [06:36] Or how when I scroll I always request the same thing from the server. [06:36] Even though I get different content. [06:38] yeah, I just noticed that [06:38] I wonder if some weird SPDY or HTTP/TCP session stuff is going on [06:38] Maybe. Probably need to pull out wireshark to investigate this one. [06:39] * namespace grumbles when google refuses to serve me anything but https [06:43] Okay so apparently I can coax wireshark into decrypting SSL traffic if I have a copy of the decryption key. How do I extract the key my browser sends google for them to securely send me data? [06:44] (That is how SSL works right? *wikis*) [06:44] try scrolling around Google Groups in Firefox and you'll get much better results [06:45] it uses POST and gets a normal-looking response body [06:45] also, right click the requests scrolling by in the Console tab and check "Log Request and Response Bodies" [06:46] I am using firefox. [06:47] oh, I only saw the No Content stuff in Chrome [06:47] I'm seeing it in Firefox too. [06:47] I'm using a SOCKS5 proxy in Firefox, but I do have spdy and websocket enabled [06:48] let me try in a non-SOCKS5 FF [06:48] Well spddy is working here, so it's not that. [06:49] Websockets also work. (Both of these determined through online "test my browser" style stuff.) [06:50] in a clean Firefox 27 profile on Windows I'm seeing normal POST requests to https://groups.google.com/forum/msg and https://groups.google.com/forum/msg_bkg [06:51] requests to /csi are No Content though [06:51] ivan`: I think I might just be reading the wrong things. For one thing until you mentioned it I assumed post requests were my browser sending stuff, and that nothing actually gets to me through that. [06:53] your browser gets a response for any kind of request (even if it's No Content) [06:54] also I noticed it's using x-gwt-rpc and this is the document that describes the protocol https://docs.google.com/document/d/1eG0YocsYYbNAtivkLtcaiEE5IOF5u4LUol8-LL0TIKU/preview?pli=1&sle=true#heading=h.lczgog5ezfjp [06:54] you may not need all of that information yet [06:55] the GWT source code is also available if you need to figure out how the stranger No Content-response transport works, but hopefully you can just do the POSTs that it does in Firefox [06:56] you should definitely click on those POST requests in Firefox's inspector and look at the response body at the end of the popup window [06:57] Okay, I'll see how far I can go with the help given so far, read the rest of that HTTP tutorial, etc. [12:15] GWT is GlamWikiToolset for me :) [14:37] urgent dump of bitcoin.it wiki needed, but no time to run wikidump myself right now [14:38] it's owned by the owner of mt. gox [14:38] and it looks like shit might be going down very soomn [14:38] soon * [14:38] Is it a big wiki? [14:41] Dud1: no, minuscule: wanna do? https://it.bitcoin.it/wiki/Speciale:Statistiche [14:42] Have it started already ;) [14:42] Ah beware there are multiple, still smallish https://en.bitcoin.it/wiki/Special:Statistics [14:46] mediawiki should have a warc handle. "archive this file and done!" [14:52] midas: I'm not sure what benefits it would have compared to HTML export, but you could file that bug :) they're working on HTML export now [14:53] midas: I think this would be best bet https://bugzilla.wikimedia.org/enter_bug.cgi?product=openZIM&component=zimwriter [14:53] oh in that case i didnt say anything, if the html export is a public function :p [14:53] There are two separate public HTML export features, but they're very ugly [14:54] The kiwix.org guy is working on making them actually work but I'm not sure those endpoints will be open and they won't be in core MediaWiki anyway. [14:59] Hence why I think a feature request in that component will be useful. :) [15:06] yeah, would be cool [15:06] specialpage:archiveteam [15:06] :p [15:06] http://www.egreetings.com/goodbye [15:07] ... what is it in feburary. [15:08] 10th [15:08] Basically if we want this one we've got two days. [15:09] i think this could be grabbed using archivebot, but yipdw could confirm or debunk that [15:09] most of it probably [15:18] Good luck guys. [16:04] How do I get Wayback to archive such a thing in a meaningful way? https://toolserver.org/~nemobis/crstats/robla/ [17:23] midas: you might as well [17:23] if it runs away we just abort [17:32] i think DFJustin is grabbing it already using the bot :) [17:42] On bitcoin.it it seems to be stuck on one particular page (for 15+ minutes) [17:44] is the ram usage increasing? [17:44] or do you see network activity [18:59] deathy: do you still have that huge server at your disposal? :) [20:03] is there a dogster/catster channel? [20:15] beardicus: #rawdogster [20:15] thanks yipdw. [20:17] what is a dogster [20:17] A pet social media site. [20:17] Lasted 10 years, is being shut down and turned into some piece of crap by the new owners. [20:18] Millions of pet profiles and messages to be deleted. [20:18] oh [20:18] and there is catster, that one is for the cat's [20:18] http://www.dogster.com/ [20:18] http://www.catster.com/ [20:39] Nemo_bis: Sorry yeah there is network activity, it just seems to be a fairly big title [20:41] Dud1: when a page is big and the site slow, it can take a while to download the whole history; the script is a bit stupid about that [20:42] I just sent a mail about archiveteam to the norwegian computer-history mailinglist. Might be someone who cares to join with a warrior there [20:43] It just finised it, there was 1851 edits to it. It doubled the size of the file. [20:45] SketchCow: How much of youtube or how many percent of youtube has the IA saved now? [20:57] so I come across wikis from time to time which archivebot is not ideal for, what's the best way to nominate them for wikiteam [20:58] DFJustin: downloading them yourself :P [20:58] DFJustin: or at least add them to https://wikiapiary.com/ [20:58] There's a nice form for that, you only need to (register and) enter name and API URL [21:14] arkiver: Well, that may be the stupidest question asked on here since we banned that french kid [21:14] Thinking...... [21:14] Yep. [21:17] Does the IA even have enough bandwidth to download youtube as fast as they are being uploaded? Trying to back up youtube would be neigh impossible. [21:17] SketchCow: haha, so what stupid question had the french kid asked? [21:17] The answer is: all of youtube. A mirror of youtube can be found at www.youtube.co.uk/ [21:18] Last week's income is quite good, must be FTP sites? :) http://teamarchive0.fnf.archive.org:8088/mrtg/networkv2.html [21:20] The french kid asked something along the lines of how could we not use his Windows/DOS filename-structured mess of Geocities saves. [21:20] Youtube adds 48 hours of video every minute. [21:20] Next stupid question. [21:20] hmm [21:20] That number may be old. [21:20] thinking already [21:20] ;) [21:21] SketchCow: they claim its >100 hours per minute now [21:22] And probably about 99.9 hours of that is total crap [21:26] https://twitter.com/johnv/status/432988213800497152 [21:26] This guy is not sending me a birthday card [21:28] SketchCow: [21:28] Archive Team: better me than your family [21:29] John Vars: [21:29] Milestone: My first twitter fight! [21:29] :P [21:33] "better me than your family" hahaha [21:44] SketchCow: [21:44] .tw https://twitter.com/johnv/status/432989058042560512 [21:45] oh [21:45] no botpie here [21:50] Dogster editor in chief started following me. [21:51] 651 followers - now there's someone who "got" social media [21:52] haha [22:11] Nemo_bis: just noticed.. the 40-ish GB of RAM one, yeah [22:12] deathy: nice :) how many cores and how much disk? [22:13] Dogster, bwahaha [22:14] Nemo_bis: 1.7 TB free disk, 4 'real' cores ( i7-920 ) [22:14] SketchCow: he feels so insulted that he wants to be sure he doesn't miss any insult you make him without mentioning him? :) [22:14] deathy: wonderful! Do you feel like repackaging the geocities torrent? :D [22:15] Nemo_bis: lol how big?? [22:16] There are some scripts to download and clean up, then a mere 7z a -t7z -m0=BZip2 -mmt=8 -mx9 should do the job (or 4 if you don't use hyperthreading) [22:16] SmileyG: it's not even 1 TB [22:16] I'd really like to have a geocities archive I can happily bzgrep ! I NEED IT [22:17] Dragan is working on that! [22:17] he's been cleaning up the geocities mess for years now. [22:17] Nemo_bis: I can, but kind of stressed with some things until Wednesday afternoon at least. If still needed remind me after that [22:19] SketchCow: I know, but is he still working on publishing another "patched version"? I doubt he'd mind being helped :) [22:19] deathy: it's not urgent at all :) the scripts are at https://github.com/despens/Geocities , found thanks to SketchCow's pointers. [22:20] He's moving to the US, he'll have more time. [22:21] Oh. Maybe he still likes some testers? [22:21] The first 8 steps don't require a database. https://github.com/despens/Geocities/tree/master/scripts/geocities.archiveteam.torrent [22:24] Nemo_bis: sent myself a reminder mail, will check back with you on Wed/Thu :) [22:26] deathy: great, if you get to it let's check the details better, especially the exact compression command [22:45] 14:43 <@tantek> another silo death coming up: http://zootool.com/goodbye [23:02] RedType: refreshingly honest shutdown message [23:03] "Zootool made us realize that the general idea of running a central service is nothing we believe in any longer. Your data should belong to you and shouldn't be stored on our servers. You shouldn't have to rely on us or on any other service to keep your data secure and online." [23:03] " On March, 15th all data and images will be deleted forever." [23:03] impressive. [23:04] "hi, we shouldn't be holding your car in this storage locker, so in a couple of weeks we're going to blow your car the fuck up" [23:05] what the fuck is wrong with people these days. cant they send us a bloody tweet in advance? [23:06] It's not going down until the 15th of March. [23:06] still not very long until it all gets deleted [23:08] zootool has 404ed their tags pages already [23:09] maybe some convincing over twitter could solve that http://zootool.com/about/ [23:11] tweet send. [23:11] to bastian [23:11] he is the only one thats really active on twitter [23:23] I got the my opera username crawl code on github [23:23] https://github.com/MithrandirAgain/myopera-username-grab [23:23] thoughts? [23:28] garyrh: how is that actually writing anything to disk? [23:29] I see [23:29] oh, wait [23:29] so it's not generating WARCs [23:29] looks like the zootool api is still live [23:29] no, it's just crawling for usernames [23:47] right, load was a tad high [23:49] SketchCow: making friends with SAY media? :p [23:55] http://archiveteam.org/index.php?title=My_Opera looks bigger than I thought, not much time