#archiveteam 2013-02-01,Fri

↑back Search

Time Nickname Message
01:48 🔗 Brenry on those geocities archive sites.. did they ever scrape user data ? or just those neighborhood things
04:27 🔗 xk_id Is 'accept-encoding': 'gzip, deflate' a non-suspicious header to use for my crawler?
04:27 🔗 xk_id e.g http://pastebin.com/Fdxxs7We
04:37 🔗 brayden I think it is a bit weird you're using a pretty old version of Firefox on that header but gzip, deflate seems to be standard.
04:37 🔗 brayden that's what I'm seeing on my connection to pastebin
04:37 🔗 brayden Do you actually capitalise the headers though?
04:37 🔗 brayden As they appear to be capitalised normally
04:38 🔗 brayden i.e. User-Agent as opposed to user-agent
04:40 🔗 xk_id hmm..
04:41 🔗 xk_id My Safari seems not to capitalise them
04:41 🔗 xk_id but Chrome does.
04:41 🔗 xk_id Strangely, however, Chrome sends a bit more headers as well....
04:44 🔗 xk_id wait, no, what am I saying, Chrome doesn't capitalise either.
04:45 🔗 * brayden opens wireshark
04:45 🔗 brayden Firefox does
04:46 🔗 xk_id It's strange that I cannot find a list with full real headers on google.
04:46 🔗 xk_id I guess I'll have to catch them myself locally from the browsers I have.
04:46 🔗 brayden http://brayden.ur.cx/images/2013-02-01_12-46-27.png is part of it.
04:47 🔗 brayden Chrome seems to capitalise too
04:47 🔗 xk_id How are you catching them?
04:47 🔗 brayden Wireshark with filter on HTTP
04:48 🔗 brayden I also have a plugin on Mozilla that shows me request headers and responses.
04:49 🔗 xk_id I've created a small server and am printing the requests... it looks like this for Chrome: http://i.imgur.com/AsLGAZW.png
04:51 🔗 brayden Just did a packet capture on the server with tcpdump and it is showing what wireshark showed.
04:51 🔗 brayden 0x0060: 3a20 6b65 6570 2d61 6c69 7665 0d0a 4163 :.keep-alive..Ac
04:51 🔗 brayden albeit a bit squished
04:51 🔗 brayden 0x0070: 6365 7074 3a20 2a2f 2a0d 0a55 7365 722d cept:.*/*..User-
04:51 🔗 brayden 0x0080: 4167 656e 743a 204d 6f7a 696c 6c61 2f35 Agent:.Mozilla/5
04:57 🔗 xk_id I'm also getting capitals with Wire Shark...
04:57 🔗 xk_id what the hell...
04:57 🔗 brayden well there you go
04:57 🔗 brayden your web server is weird :P
04:57 🔗 xk_id I suppose my server was doing some parsing?
04:57 🔗 xk_id strange, but okay.
04:57 🔗 brayden Looks like it gave some JSON-like output?
04:58 🔗 xk_id oh
04:58 🔗 xk_id yes, that's correct
04:58 🔗 xk_id Now, I still have the problem of finding a bunch of genuine headers.
05:05 🔗 xk_id Well I suppose I could just capitalise them.
05:22 🔗 xk_id brayden: can I direct my crawler to your server to test its headers, pelase?
05:22 🔗 brayden I don't have a script to return headers
05:22 🔗 xk_id ah, okay.
05:23 🔗 xk_id I'm a bit concerned because when I define the headers, I define them in JSON. So I'm not sure what Node.js is doing with the objects afterwards.
05:23 🔗 xk_id fingers crossed.
05:24 🔗 brayden oh
05:24 🔗 brayden Do nc -lk 80
05:24 🔗 brayden where 80 is the port
05:24 🔗 brayden k keeps it open after the connection has been closed, i.e. the script open
05:24 🔗 brayden It should send headers
05:34 🔗 xk_id brayden: nice :) They appear capitalised. Besides 'host' :/ which I haven't configured...
05:34 🔗 brayden nice
05:35 🔗 xk_id I will add 'host' to my customised headers. I think by default the httpclient I'm using makes it lower case.
05:36 🔗 xk_id and thanks. First time using netcat, actually (I know).
05:37 🔗 brayden I've only ever used netcat in a project like once but fortunately its syntax is pretty simple
05:37 🔗 brayden Since there was a bash script that, part of its functionality, would listen to connections from a master to slaves
05:38 🔗 xk_id very handy tool.
05:40 🔗 xk_id Great, my client overwrites the 'host' header. I think I need to fiddle with the source.
05:41 🔗 brayden if you have nmap installed you get ncat as well which has SSL!
07:18 🔗 lemonkey stickam shutting down
07:47 🔗 lemonkey http://blog.stickam.com/post/41909003713/stickamclosing
08:00 🔗 db48x yep
08:04 🔗 db48x we need to get started on it
08:07 🔗 db48x hmm, no wiki page yet
08:10 🔗 db48x wait, closing January 31st?
08:11 🔗 SketchCow Morning.
08:11 🔗 SketchCow It's 9:11am in East Berlin, now Berlin
08:13 🔗 db48x oh, it begins closing 12 minutes ago, and dissapears Feburary 28th
08:13 🔗 Deewiant Presumably the January 31st bit indicates read-only mode
08:13 🔗 db48x yea
08:14 🔗 db48x for a second there I thought that had given a whole day's notice
08:15 🔗 Deewiant It seems that all pages are replaced with the memorial note?
08:15 🔗 adamcaudi Yeah, looks like they just took at all down
08:15 🔗 adamcaudi *it
08:16 🔗 db48x ouch
08:16 🔗 db48x I could browse groups a few minutes ago
08:16 🔗 db48x there was even a live stream in progress on the front page
08:16 🔗 Deewiant Google cache has some stuff, with images still up at least
08:17 🔗 db48x and hundreds of people in chat rooms
08:18 🔗 Deewiant Their "random video from the Stickam archives" player doesn't seem to work at least for me: staging.stickam-player.stk doesn't resolve
08:19 🔗 Deewiant Aha, https still works!
08:19 🔗 Deewiant E.g. http://www.stickam.com/theoneringnet vs https://www.stickam.com/theoneringnet
08:21 🔗 db48x ooh
08:22 🔗 db48x no, only partially
08:22 🔗 db48x groups are gone
08:22 🔗 SketchCow Wow, they proabaly lost a lot of money really quickly.
08:22 🔗 SketchCow Someone shut off the tap
08:22 🔗 db48x who's Live is all empty
08:22 🔗 db48x maybe they're pulling data from http though
08:22 🔗 db48x SketchCow: yea
08:26 🔗 SketchCow We might be screwed here, which is understandable.
08:26 🔗 db48x the wording of the message was pretty misleading, too
08:27 🔗 db48x it said that the site would remain alive until the 28th
08:27 🔗 db48x well, I updated the wiki page
08:27 🔗 db48x for whatever that's worth
08:29 🔗 adamcaudi Looks like the https version of the "who's online" page still works - leads to working profiles, and working group pages
08:29 🔗 db48x we might be able to spider the https site
08:29 🔗 db48x adamcaudi: yea
08:29 🔗 db48x I can't get any videos to load though
08:31 🔗 db48x heh, clicking on the Randomizer button off to the side pops up an alert saying 'There is no live user.'
08:38 🔗 SketchCow Wow
08:43 🔗 ersi SketchCow: Hey timezone buddy.
08:43 🔗 db48x I guess we have to go down the list of social networks in the wiki and just do them all now
08:45 🔗 ersi "The site will remain alive here until February 28, 2013." from the StickAm post.
08:45 🔗 db48x ersi: yea
08:45 🔗 db48x and technically it still is there, and if you have an account you can log in and download your videos
08:45 🔗 ersi Also, that was a fucking disasterous background on that blog post.. Barely readable.
08:45 🔗 SketchCow Where's the visiting me
08:46 🔗 ersi oops, I missed your line re 28th of february. Thought no one mentioned that
08:46 🔗 SketchCow Or are you another one of the archive team members who makes $5 a week
08:46 🔗 db48x so perhaps we could accidentally liberate a username/password list and just download everything ourselves
08:48 🔗 db48x http://www.archiveteam.org/index.php?title=Stickam
08:49 🔗 db48x that squarish dude with the sad eyes in their goodby banner would make a good image for the page :P
08:51 🔗 db48x hrm, http://player.stickam.com/flash/stickam/stickam_player.swf still exists, sorta
08:52 🔗 db48x it's still a real swf file
08:52 🔗 SketchCow I just shifted over the videos, godane.
08:52 🔗 SketchCow So everything that's in g4video by mistake is where it should be
09:02 🔗 xk_id Anybody has any tips on figuring out if I'm getting a "hello world" page instead of the actual page I wish to crawl? (i.e getting 'blacklisted' by the website)
09:03 🔗 xk_id actually..... no, there can't be.
09:03 🔗 xk_id even a human wouldn't be able to tell.
09:03 🔗 ersi By finding how their "Fuck you page" looks and then knowing how it looks and looking for it :)
09:04 🔗 xk_id :D
09:04 🔗 adamcaudi xk_id, are you sure you're past the host right?
09:04 🔗 ersi Most likely, you'd get firewall'd off or 404'd/500'd or something
09:04 🔗 xk_id adamcaudi: what do you mean? the lower case header?
09:05 🔗 adamcaudi xk_id, many servers return a default page if it can't find / understand what host you are asking for
09:06 🔗 xk_id Sorry, I'm still not sure what you mean :) why would my host be illogical?
09:08 🔗 xk_id as far as I can tell, my spider sends intelligible headers, and the RFC says they are not case sensitive
09:10 🔗 adamcaudi I've seen case sensitive implementaions - even though the RFC says it doesn't matter
09:11 🔗 xk_id Unfortunately there's not much I can do at the moment. It has to do with the module I'm using. We've tried modifying the source code, but we're afraid of breaking something
09:11 🔗 xk_id we're waiting for a developer to reply: https://github.com/mikeal/request/issues/426
09:12 🔗 xk_id thanks for the heads up...
09:14 🔗 xk_id adamcaudi: do you have some reference I could add on github?
09:14 🔗 xk_id perhaps it will press devs to respond
09:15 🔗 adamcaudi I'll let you know if I can remember which server it was - been some time, can't think of which one it is right now
09:16 🔗 xk_id Ok
09:17 🔗 adamcaudi Do you have the actual request that was sent? Curious to see if there's something else odd with it
09:22 🔗 xk_id adamcaudi: par example http://dpaste.com/903092/
09:22 🔗 xk_id as captured by netcat
09:26 🔗 xk_id They're artificially created by me, btw
09:26 🔗 xk_id well, at the upper-cased ones at least :P
09:26 🔗 xk_id without *at
09:36 🔗 adamcaudi Only thing that jumps out at me is the order is odd - host is normally the first header (so second line), but that shouldn't change anything
09:39 🔗 xk_id hmm...
10:26 🔗 Nemo_bis http://xkcd.com/1168/
10:27 🔗 ersi #archiveteam-bs man
10:27 🔗 Nemo_bis I don't think so
12:44 🔗 godane this sucks
12:45 🔗 godane i downloaded nerds 2.0 pbs series
12:45 🔗 godane looks like the video is only under 600kps when the file is 891mb
12:45 🔗 godane this is cause the audio is pcm and has a bitrate of 1411kps
14:35 🔗 turnkit someone fell asleep encoding that one :(
14:45 🔗 godane its still very watchable
14:45 🔗 godane and when its devide there will be a smaller one
14:49 🔗 godane i'm uploading a blockbuster customer service tape from 2000
14:50 🔗 godane i got another one also that i will upload called the different guest
22:35 🔗 S[h]O[r]T is there any public effort into archiving pastebin type sites?
22:35 🔗 S[h]O[r]T *ongoing
22:44 🔗 ersi Not that I'm aware of

irclogger-viewer