#archiveteam 2014-02-10,Mon

↑back Search

Time Nickname Message
00:53 🔗 Leo_TCK but tell me what have you found out
00:53 🔗 Leo_TCK oops
00:53 🔗 Leo_TCK sry wrong channel
00:53 🔗 Leo_TCK didnt mean to say that here
05:54 🔗 namespace So earlier you guys said that the only way to grab the data from Googles awful site. (Not gonna lie, I think google groups is the least usable site I've ever seen from a professional company that should know better, let alone the worst usenet client.)
05:55 🔗 namespace was to use phantomJS, would this be done through grabs of the state of the webkit client while it's on a particular comment tree or?
05:57 🔗 namespace Is there somewhere I can go to read about how previous JS heavy sites have been grabbed?
06:07 🔗 ivan` namespace: using a browser engine is probably a bad idea given that there are hundreds of millions of pages
06:09 🔗 ivan` Google Reader was JavaScript-heavy and that was grabbed via a JSON API
06:09 🔗 ivan` you'd have to figure out what the browser is requesting and write some Python to make similar requests
06:11 🔗 namespace ivan`: Actually interestingly enough, I'm reading a stackoverflow thread that suggests there may be a api for google groups.
06:11 🔗 ivan` Leo_TCK: it wasn't in any of the .zips in my ut-files grab, sorry
06:11 🔗 namespace https://stackoverflow.com/questions/3757793/api-for-google-groups
06:11 🔗 namespace Nope, nevermind it's just for getting lists of users. -_-
06:14 🔗 SketchCow Hey, maniacs.
06:14 🔗 namespace SketchCow: Hey. Stupid question, are you Jason Scott?
06:15 🔗 ivan` /whois SketchCow ;)
06:15 🔗 namespace ivan`: Oh right, I forgot IRC has that feature.
06:15 🔗 * SketchCow sings the phantom of the opera soundtrack
06:16 🔗 SketchCow iiiiiii aammmm the pahnnntoooommm of the aaarrrrrchhhiives
06:16 🔗 namespace Seriously though dude, you're awesome. You're this modern monuments man saving history from the clutches of megacorps and "dewey eyed fucksticks".
06:17 🔗 namespace Now back to groups...
06:18 🔗 namespace So part of the problem I'm having is that if you take a look at the stuff that the GG client spits back at you when you try and inspect it, it's all obfuscated or minimized garbage.
06:20 🔗 namespace (Or rather the source code is at least, the actual network requests just don't lend themselves to comprehension.)
06:20 🔗 yipdw that's actually why I suggested a browser engine :P
06:21 🔗 yipdw it eliminates the need to know that layer (which is subject to change anyway)
06:21 🔗 yipdw now you do have the scaling problem
06:21 🔗 yipdw I'd be interested to see how well the Warrior or something like it could cope
06:24 🔗 ivan` if you go with the browser engine, you're counting on 1) Google's JavaScript actually working as you want it (huge threads, no intentional blocking features added by goog); 2) warriors having hundreds of MB free for webkit/blink
06:24 🔗 ivan` someone like Kenshin can do hundreds of requests per second but probably not with a browser
06:24 🔗 namespace So I have zero experience with the web stack. (Http/HTML/JS/etc). Should I start with the http RFC or?
06:26 🔗 ivan` http://www.garshol.priv.no/download/text/http-tut.html
06:27 🔗 ivan` http://www.jmarshall.com/easy/http/
06:27 🔗 ivan` I don't think there's really that much to know; just get your URL and request headers right
06:27 🔗 ivan` wget or wpull or requests will handle the HTTP requesting for you
06:28 🔗 ivan` reverse-engineering google's blobs of JS is hopefully unnecessary
06:31 🔗 ivan` if you really want to learn JS stuff, https://sivers.org/learn-js suggests Professional JavaScript for Web Developers, 3rd Edition
06:33 🔗 namespace "RFC 2068 describes HTTP/1.1, which extends and improves HTTP/1.0 in a number of areas. Very few browsers support it (MSIE 4.0 is the only one known to the author), but servers are beginning to do so." This is pretty old guide. :P
06:33 🔗 aggrosk I recommend "Javascript: The Good Parts" if you have previous programming experience.
06:35 🔗 namespace ivan`: So the general problem with that approach is that I can't even tell what I'm requesting.
06:35 🔗 namespace It's not like GG has a sane directory structure.
06:36 🔗 namespace Most of the requests my browser makes supposedly have no content, etc.
06:36 🔗 namespace Or how when I scroll I always request the same thing from the server.
06:36 🔗 namespace Even though I get different content.
06:38 🔗 ivan` yeah, I just noticed that
06:38 🔗 ivan` I wonder if some weird SPDY or HTTP/TCP session stuff is going on
06:38 🔗 namespace Maybe. Probably need to pull out wireshark to investigate this one.
06:39 🔗 * namespace grumbles when google refuses to serve me anything but https
06:43 🔗 namespace Okay so apparently I can coax wireshark into decrypting SSL traffic if I have a copy of the decryption key. How do I extract the key my browser sends google for them to securely send me data?
06:44 🔗 namespace (That is how SSL works right? *wikis*)
06:44 🔗 ivan` try scrolling around Google Groups in Firefox and you'll get much better results
06:45 🔗 ivan` it uses POST and gets a normal-looking response body
06:45 🔗 ivan` also, right click the requests scrolling by in the Console tab and check "Log Request and Response Bodies"
06:46 🔗 namespace I am using firefox.
06:47 🔗 ivan` oh, I only saw the No Content stuff in Chrome
06:47 🔗 namespace I'm seeing it in Firefox too.
06:47 🔗 ivan` I'm using a SOCKS5 proxy in Firefox, but I do have spdy and websocket enabled
06:48 🔗 ivan` let me try in a non-SOCKS5 FF
06:48 🔗 namespace Well spddy is working here, so it's not that.
06:49 🔗 namespace Websockets also work. (Both of these determined through online "test my browser" style stuff.)
06:50 🔗 ivan` in a clean Firefox 27 profile on Windows I'm seeing normal POST requests to https://groups.google.com/forum/msg and https://groups.google.com/forum/msg_bkg
06:51 🔗 ivan` requests to /csi are No Content though
06:51 🔗 namespace ivan`: I think I might just be reading the wrong things. For one thing until you mentioned it I assumed post requests were my browser sending stuff, and that nothing actually gets to me through that.
06:53 🔗 ivan` your browser gets a response for any kind of request (even if it's No Content)
06:54 🔗 ivan` also I noticed it's using x-gwt-rpc and this is the document that describes the protocol https://docs.google.com/document/d/1eG0YocsYYbNAtivkLtcaiEE5IOF5u4LUol8-LL0TIKU/preview?pli=1&sle=true#heading=h.lczgog5ezfjp
06:54 🔗 ivan` you may not need all of that information yet
06:55 🔗 ivan` the GWT source code is also available if you need to figure out how the stranger No Content-response transport works, but hopefully you can just do the POSTs that it does in Firefox
06:56 🔗 ivan` you should definitely click on those POST requests in Firefox's inspector and look at the response body at the end of the popup window
06:57 🔗 namespace Okay, I'll see how far I can go with the help given so far, read the rest of that HTTP tutorial, etc.
12:15 🔗 Nemo_bis GWT is GlamWikiToolset for me :)
14:37 🔗 joepie91 urgent dump of bitcoin.it wiki needed, but no time to run wikidump myself right now
14:38 🔗 joepie91 it's owned by the owner of mt. gox
14:38 🔗 joepie91 and it looks like shit might be going down very soomn
14:38 🔗 joepie91 soon *
14:38 🔗 Dud1 Is it a big wiki?
14:41 🔗 Nemo_bis Dud1: no, minuscule: wanna do? https://it.bitcoin.it/wiki/Speciale:Statistiche
14:42 🔗 Dud1 Have it started already ;)
14:42 🔗 Nemo_bis Ah beware there are multiple, still smallish https://en.bitcoin.it/wiki/Special:Statistics
14:46 🔗 midas mediawiki should have a warc handle. "archive this file and done!"
14:52 🔗 Nemo_bis midas: I'm not sure what benefits it would have compared to HTML export, but you could file that bug :) they're working on HTML export now
14:53 🔗 Nemo_bis midas: I think this would be best bet https://bugzilla.wikimedia.org/enter_bug.cgi?product=openZIM&component=zimwriter
14:53 🔗 midas oh in that case i didnt say anything, if the html export is a public function :p
14:53 🔗 Nemo_bis There are two separate public HTML export features, but they're very ugly
14:54 🔗 Nemo_bis The kiwix.org guy is working on making them actually work but I'm not sure those endpoints will be open and they won't be in core MediaWiki anyway.
14:59 🔗 Nemo_bis Hence why I think a feature request in that component will be useful. :)
15:06 🔗 midas yeah, would be cool
15:06 🔗 midas specialpage:archiveteam
15:06 🔗 midas :p
15:06 🔗 DFJustin http://www.egreetings.com/goodbye
15:07 🔗 midas ... what is it in feburary.
15:08 🔗 namespace 10th
15:08 🔗 namespace Basically if we want this one we've got two days.
15:09 🔗 midas i think this could be grabbed using archivebot, but yipdw could confirm or debunk that
15:09 🔗 DFJustin most of it probably
15:18 🔗 namespace Good luck guys.
16:04 🔗 Nemo_bis How do I get Wayback to archive such a thing in a meaningful way? https://toolserver.org/~nemobis/crstats/robla/
17:23 🔗 yipdw midas: you might as well
17:23 🔗 yipdw if it runs away we just abort
17:32 🔗 midas i think DFJustin is grabbing it already using the bot :)
17:42 🔗 Dud2 On bitcoin.it it seems to be stuck on one particular page (for 15+ minutes)
17:44 🔗 Nemo_bis is the ram usage increasing?
17:44 🔗 Nemo_bis or do you see network activity
18:59 🔗 Nemo_bis deathy: do you still have that huge server at your disposal? :)
20:03 🔗 beardicus is there a dogster/catster channel?
20:15 🔗 yipdw beardicus: #rawdogster
20:15 🔗 beardicus thanks yipdw.
20:17 🔗 Leo_TCK what is a dogster
20:17 🔗 SketchCow A pet social media site.
20:17 🔗 SketchCow Lasted 10 years, is being shut down and turned into some piece of crap by the new owners.
20:18 🔗 SketchCow Millions of pet profiles and messages to be deleted.
20:18 🔗 Leo_TCK oh
20:18 🔗 arkiver and there is catster, that one is for the cat's
20:18 🔗 SketchCow http://www.dogster.com/
20:18 🔗 arkiver http://www.catster.com/
20:39 🔗 Dud1 Nemo_bis: Sorry yeah there is network activity, it just seems to be a fairly big title
20:41 🔗 Nemo_bis Dud1: when a page is big and the site slow, it can take a while to download the whole history; the script is a bit stupid about that
20:42 🔗 Atluxity I just sent a mail about archiveteam to the norwegian computer-history mailinglist. Might be someone who cares to join with a warrior there
20:43 🔗 Dud1 It just finised it, there was 1851 edits to it. It doubled the size of the file.
20:45 🔗 arkiver SketchCow: How much of youtube or how many percent of youtube has the IA saved now?
20:57 🔗 DFJustin so I come across wikis from time to time which archivebot is not ideal for, what's the best way to nominate them for wikiteam
20:58 🔗 Nemo_bis DFJustin: downloading them yourself :P
20:58 🔗 Nemo_bis DFJustin: or at least add them to https://wikiapiary.com/
20:58 🔗 Nemo_bis There's a nice form for that, you only need to (register and) enter name and API URL
21:14 🔗 SketchCow arkiver: Well, that may be the stupidest question asked on here since we banned that french kid
21:14 🔗 SketchCow Thinking......
21:14 🔗 SketchCow Yep.
21:17 🔗 Jonimus Does the IA even have enough bandwidth to download youtube as fast as they are being uploaded? Trying to back up youtube would be neigh impossible.
21:17 🔗 arkiver SketchCow: haha, so what stupid question had the french kid asked?
21:17 🔗 RedType The answer is: all of youtube. A mirror of youtube can be found at www.youtube.co.uk/
21:18 🔗 Nemo_bis Last week's income is quite good, must be FTP sites? :) http://teamarchive0.fnf.archive.org:8088/mrtg/networkv2.html
21:20 🔗 SketchCow The french kid asked something along the lines of how could we not use his Windows/DOS filename-structured mess of Geocities saves.
21:20 🔗 SketchCow Youtube adds 48 hours of video every minute.
21:20 🔗 SketchCow Next stupid question.
21:20 🔗 arkiver hmm
21:20 🔗 SketchCow That number may be old.
21:20 🔗 arkiver thinking already
21:20 🔗 arkiver ;)
21:21 🔗 RedType SketchCow: they claim its >100 hours per minute now
21:22 🔗 Dud1 And probably about 99.9 hours of that is total crap
21:26 🔗 SketchCow https://twitter.com/johnv/status/432988213800497152
21:26 🔗 SketchCow This guy is not sending me a birthday card
21:28 🔗 RedType SketchCow:
21:28 🔗 RedType Archive Team: better me than your family
21:29 🔗 arkiver John Vars:
21:29 🔗 arkiver Milestone: My first twitter fight!
21:29 🔗 arkiver :P
21:33 🔗 yipdw "better me than your family" hahaha
21:44 🔗 joepie91 SketchCow:
21:44 🔗 joepie91 .tw https://twitter.com/johnv/status/432989058042560512
21:45 🔗 joepie91 oh
21:45 🔗 joepie91 no botpie here
21:50 🔗 SketchCow Dogster editor in chief started following me.
21:51 🔗 SketchCow 651 followers - now there's someone who "got" social media
21:52 🔗 joepie91 haha
22:11 🔗 deathy Nemo_bis: just noticed.. the 40-ish GB of RAM one, yeah
22:12 🔗 Nemo_bis deathy: nice :) how many cores and how much disk?
22:13 🔗 ersi Dogster, bwahaha
22:14 🔗 deathy Nemo_bis: 1.7 TB free disk, 4 'real' cores ( i7-920 )
22:14 🔗 Nemo_bis SketchCow: he feels so insulted that he wants to be sure he doesn't miss any insult you make him without mentioning him? :)
22:14 🔗 Nemo_bis deathy: wonderful! Do you feel like repackaging the geocities torrent? :D
22:15 🔗 SmileyG Nemo_bis: lol how big??
22:16 🔗 Nemo_bis There are some scripts to download and clean up, then a mere 7z a -t7z -m0=BZip2 -mmt=8 -mx9 should do the job (or 4 if you don't use hyperthreading)
22:16 🔗 Nemo_bis SmileyG: it's not even 1 TB
22:16 🔗 Nemo_bis I'd really like to have a geocities archive I can happily bzgrep ! I NEED IT
22:17 🔗 SketchCow Dragan is working on that!
22:17 🔗 SketchCow he's been cleaning up the geocities mess for years now.
22:17 🔗 deathy Nemo_bis: I can, but kind of stressed with some things until Wednesday afternoon at least. If still needed remind me after that
22:19 🔗 Nemo_bis SketchCow: I know, but is he still working on publishing another "patched version"? I doubt he'd mind being helped :)
22:19 🔗 Nemo_bis deathy: it's not urgent at all :) the scripts are at https://github.com/despens/Geocities , found thanks to SketchCow's pointers.
22:20 🔗 SketchCow He's moving to the US, he'll have more time.
22:21 🔗 Nemo_bis Oh. Maybe he still likes some testers?
22:21 🔗 Nemo_bis The first 8 steps don't require a database. https://github.com/despens/Geocities/tree/master/scripts/geocities.archiveteam.torrent
22:24 🔗 deathy Nemo_bis: sent myself a reminder mail, will check back with you on Wed/Thu :)
22:26 🔗 Nemo_bis deathy: great, if you get to it let's check the details better, especially the exact compression command
22:45 🔗 RedType 14:43 <@tantek> another silo death coming up: http://zootool.com/goodbye
23:02 🔗 joepie91 RedType: refreshingly honest shutdown message
23:03 🔗 RedType "Zootool made us realize that the general idea of running a central service is nothing we believe in any longer. Your data should belong to you and shouldn't be stored on our servers. You shouldn't have to rely on us or on any other service to keep your data secure and online."
23:03 🔗 RedType " On March, 15th all data and images will be deleted forever."
23:03 🔗 midas impressive.
23:04 🔗 BlueMax "hi, we shouldn't be holding your car in this storage locker, so in a couple of weeks we're going to blow your car the fuck up"
23:05 🔗 midas what the fuck is wrong with people these days. cant they send us a bloody tweet in advance?
23:06 🔗 Dud1 It's not going down until the 15th of March.
23:06 🔗 BlueMax still not very long until it all gets deleted
23:08 🔗 ivan` zootool has 404ed their tags pages already
23:09 🔗 ivan` maybe some convincing over twitter could solve that http://zootool.com/about/
23:11 🔗 midas tweet send.
23:11 🔗 midas to bastian
23:11 🔗 midas he is the only one thats really active on twitter
23:23 🔗 garyrh I got the my opera username crawl code on github
23:23 🔗 garyrh https://github.com/MithrandirAgain/myopera-username-grab
23:23 🔗 garyrh thoughts?
23:28 🔗 yipdw garyrh: how is that actually writing anything to disk?
23:29 🔗 yipdw I see
23:29 🔗 yipdw oh, wait
23:29 🔗 yipdw so it's not generating WARCs
23:29 🔗 RedType looks like the zootool api is still live
23:29 🔗 garyrh no, it's just crawling for usernames
23:47 🔗 midas right, load was a tad high
23:49 🔗 midas SketchCow: making friends with SAY media? :p
23:55 🔗 Nemo_bis http://archiveteam.org/index.php?title=My_Opera looks bigger than I thought, not much time

irclogger-viewer