[01:48] on those geocities archive sites.. did they ever scrape user data ? or just those neighborhood things [04:27] Is 'accept-encoding': 'gzip, deflate' a non-suspicious header to use for my crawler? [04:27] e.g http://pastebin.com/Fdxxs7We [04:37] I think it is a bit weird you're using a pretty old version of Firefox on that header but gzip, deflate seems to be standard. [04:37] that's what I'm seeing on my connection to pastebin [04:37] Do you actually capitalise the headers though? [04:37] As they appear to be capitalised normally [04:38] i.e. User-Agent as opposed to user-agent [04:40] hmm.. [04:41] My Safari seems not to capitalise them [04:41] but Chrome does. [04:41] Strangely, however, Chrome sends a bit more headers as well.... [04:44] wait, no, what am I saying, Chrome doesn't capitalise either. [04:45] * brayden opens wireshark [04:45] Firefox does [04:46] It's strange that I cannot find a list with full real headers on google. [04:46] I guess I'll have to catch them myself locally from the browsers I have. [04:46] http://brayden.ur.cx/images/2013-02-01_12-46-27.png is part of it. [04:47] Chrome seems to capitalise too [04:47] How are you catching them? [04:47] Wireshark with filter on HTTP [04:48] I also have a plugin on Mozilla that shows me request headers and responses. [04:49] I've created a small server and am printing the requests... it looks like this for Chrome: http://i.imgur.com/AsLGAZW.png [04:51] Just did a packet capture on the server with tcpdump and it is showing what wireshark showed. [04:51] 0x0060: 3a20 6b65 6570 2d61 6c69 7665 0d0a 4163 :.keep-alive..Ac [04:51] albeit a bit squished [04:51] 0x0070: 6365 7074 3a20 2a2f 2a0d 0a55 7365 722d cept:.*/*..User- [04:51] 0x0080: 4167 656e 743a 204d 6f7a 696c 6c61 2f35 Agent:.Mozilla/5 [04:57] I'm also getting capitals with Wire Shark... [04:57] what the hell... [04:57] well there you go [04:57] your web server is weird :P [04:57] I suppose my server was doing some parsing? [04:57] strange, but okay. [04:57] Looks like it gave some JSON-like output? [04:58] oh [04:58] yes, that's correct [04:58] Now, I still have the problem of finding a bunch of genuine headers. [05:05] Well I suppose I could just capitalise them. [05:22] brayden: can I direct my crawler to your server to test its headers, pelase? [05:22] I don't have a script to return headers [05:22] ah, okay. [05:23] I'm a bit concerned because when I define the headers, I define them in JSON. So I'm not sure what Node.js is doing with the objects afterwards. [05:23] fingers crossed. [05:24] oh [05:24] Do nc -lk 80 [05:24] where 80 is the port [05:24] k keeps it open after the connection has been closed, i.e. the script open [05:24] It should send headers [05:34] brayden: nice :) They appear capitalised. Besides 'host' :/ which I haven't configured... [05:34] nice [05:35] I will add 'host' to my customised headers. I think by default the httpclient I'm using makes it lower case. [05:36] and thanks. First time using netcat, actually (I know). [05:37] I've only ever used netcat in a project like once but fortunately its syntax is pretty simple [05:37] Since there was a bash script that, part of its functionality, would listen to connections from a master to slaves [05:38] very handy tool. [05:40] Great, my client overwrites the 'host' header. I think I need to fiddle with the source. [05:41] if you have nmap installed you get ncat as well which has SSL! [07:18] stickam shutting down [07:47] http://blog.stickam.com/post/41909003713/stickamclosing [08:00] yep [08:04] we need to get started on it [08:07] hmm, no wiki page yet [08:10] wait, closing January 31st? [08:11] Morning. [08:11] It's 9:11am in East Berlin, now Berlin [08:13] oh, it begins closing 12 minutes ago, and dissapears Feburary 28th [08:13] Presumably the January 31st bit indicates read-only mode [08:13] yea [08:14] for a second there I thought that had given a whole day's notice [08:15] It seems that all pages are replaced with the memorial note? [08:15] Yeah, looks like they just took at all down [08:15] *it [08:16] ouch [08:16] I could browse groups a few minutes ago [08:16] there was even a live stream in progress on the front page [08:16] Google cache has some stuff, with images still up at least [08:17] and hundreds of people in chat rooms [08:18] Their "random video from the Stickam archives" player doesn't seem to work at least for me: staging.stickam-player.stk doesn't resolve [08:19] Aha, https still works! [08:19] E.g. http://www.stickam.com/theoneringnet vs https://www.stickam.com/theoneringnet [08:21] ooh [08:22] no, only partially [08:22] groups are gone [08:22] Wow, they proabaly lost a lot of money really quickly. [08:22] Someone shut off the tap [08:22] who's Live is all empty [08:22] maybe they're pulling data from http though [08:22] SketchCow: yea [08:26] We might be screwed here, which is understandable. [08:26] the wording of the message was pretty misleading, too [08:27] it said that the site would remain alive until the 28th [08:27] well, I updated the wiki page [08:27] for whatever that's worth [08:29] Looks like the https version of the "who's online" page still works - leads to working profiles, and working group pages [08:29] we might be able to spider the https site [08:29] adamcaudi: yea [08:29] I can't get any videos to load though [08:31] heh, clicking on the Randomizer button off to the side pops up an alert saying 'There is no live user.' [08:38] Wow [08:43] SketchCow: Hey timezone buddy. [08:43] I guess we have to go down the list of social networks in the wiki and just do them all now [08:45] "The site will remain alive here until February 28, 2013." from the StickAm post. [08:45] ersi: yea [08:45] and technically it still is there, and if you have an account you can log in and download your videos [08:45] Also, that was a fucking disasterous background on that blog post.. Barely readable. [08:45] Where's the visiting me [08:46] oops, I missed your line re 28th of february. Thought no one mentioned that [08:46] Or are you another one of the archive team members who makes $5 a week [08:46] so perhaps we could accidentally liberate a username/password list and just download everything ourselves [08:48] http://www.archiveteam.org/index.php?title=Stickam [08:49] that squarish dude with the sad eyes in their goodby banner would make a good image for the page :P [08:51] hrm, http://player.stickam.com/flash/stickam/stickam_player.swf still exists, sorta [08:52] it's still a real swf file [08:52] I just shifted over the videos, godane. [08:52] So everything that's in g4video by mistake is where it should be [09:02] Anybody has any tips on figuring out if I'm getting a "hello world" page instead of the actual page I wish to crawl? (i.e getting 'blacklisted' by the website) [09:03] actually..... no, there can't be. [09:03] even a human wouldn't be able to tell. [09:03] By finding how their "Fuck you page" looks and then knowing how it looks and looking for it :) [09:04] :D [09:04] xk_id, are you sure you're past the host right? [09:04] Most likely, you'd get firewall'd off or 404'd/500'd or something [09:04] adamcaudi: what do you mean? the lower case header? [09:05] xk_id, many servers return a default page if it can't find / understand what host you are asking for [09:06] Sorry, I'm still not sure what you mean :) why would my host be illogical? [09:08] as far as I can tell, my spider sends intelligible headers, and the RFC says they are not case sensitive [09:10] I've seen case sensitive implementaions - even though the RFC says it doesn't matter [09:11] Unfortunately there's not much I can do at the moment. It has to do with the module I'm using. We've tried modifying the source code, but we're afraid of breaking something [09:11] we're waiting for a developer to reply: https://github.com/mikeal/request/issues/426 [09:12] thanks for the heads up... [09:14] adamcaudi: do you have some reference I could add on github? [09:14] perhaps it will press devs to respond [09:15] I'll let you know if I can remember which server it was - been some time, can't think of which one it is right now [09:16] Ok [09:17] Do you have the actual request that was sent? Curious to see if there's something else odd with it [09:22] adamcaudi: par example http://dpaste.com/903092/ [09:22] as captured by netcat [09:26] They're artificially created by me, btw [09:26] well, at the upper-cased ones at least :P [09:26] without *at [09:36] Only thing that jumps out at me is the order is odd - host is normally the first header (so second line), but that shouldn't change anything [09:39] hmm... [10:26] http://xkcd.com/1168/ [10:27] #archiveteam-bs man [10:27] I don't think so [12:44] this sucks [12:45] i downloaded nerds 2.0 pbs series [12:45] looks like the video is only under 600kps when the file is 891mb [12:45] this is cause the audio is pcm and has a bitrate of 1411kps [14:35] someone fell asleep encoding that one :( [14:45] its still very watchable [14:45] and when its devide there will be a smaller one [14:49] i'm uploading a blockbuster customer service tape from 2000 [14:50] i got another one also that i will upload called the different guest [22:35] is there any public effort into archiving pastebin type sites? [22:35] *ongoing [22:44] Not that I'm aware of