#archiveteam 2013-02-01,Fri

↑back Search

Time	Nickname	Message
01:48 ^🔗	Brenry	on those geocities archive sites.. did they ever scrape user data ? or just those neighborhood things
04:27 ^🔗	xk_id	Is 'accept-encoding': 'gzip, deflate' a non-suspicious header to use for my crawler?
04:27 ^🔗	xk_id	e.g http://pastebin.com/Fdxxs7We
04:37 ^🔗	brayden	I think it is a bit weird you're using a pretty old version of Firefox on that header but gzip, deflate seems to be standard.
04:37 ^🔗	brayden	that's what I'm seeing on my connection to pastebin
04:37 ^🔗	brayden	Do you actually capitalise the headers though?
04:37 ^🔗	brayden	As they appear to be capitalised normally
04:38 ^🔗	brayden	i.e. User-Agent as opposed to user-agent
04:40 ^🔗	xk_id	hmm..
04:41 ^🔗	xk_id	My Safari seems not to capitalise them
04:41 ^🔗	xk_id	but Chrome does.
04:41 ^🔗	xk_id	Strangely, however, Chrome sends a bit more headers as well....
04:44 ^🔗	xk_id	wait, no, what am I saying, Chrome doesn't capitalise either.
04:45 ^🔗	*	brayden opens wireshark
04:45 ^🔗	brayden	Firefox does
04:46 ^🔗	xk_id	It's strange that I cannot find a list with full real headers on google.
04:46 ^🔗	xk_id	I guess I'll have to catch them myself locally from the browsers I have.
04:46 ^🔗	brayden	http://brayden.ur.cx/images/2013-02-01_12-46-27.png is part of it.
04:47 ^🔗	brayden	Chrome seems to capitalise too
04:47 ^🔗	xk_id	How are you catching them?
04:47 ^🔗	brayden	Wireshark with filter on HTTP
04:48 ^🔗	brayden	I also have a plugin on Mozilla that shows me request headers and responses.
04:49 ^🔗	xk_id	I've created a small server and am printing the requests... it looks like this for Chrome: http://i.imgur.com/AsLGAZW.png
04:51 ^🔗	brayden	Just did a packet capture on the server with tcpdump and it is showing what wireshark showed.
04:51 ^🔗	brayden	0x0060: 3a20 6b65 6570 2d61 6c69 7665 0d0a 4163 :.keep-alive..Ac
04:51 ^🔗	brayden	albeit a bit squished
04:51 ^🔗	brayden	0x0070: 6365 7074 3a20 2a2f 2a0d 0a55 7365 722d cept:./..User-
04:51 ^🔗	brayden	0x0080: 4167 656e 743a 204d 6f7a 696c 6c61 2f35 Agent:.Mozilla/5
04:57 ^🔗	xk_id	I'm also getting capitals with Wire Shark...
04:57 ^🔗	xk_id	what the hell...
04:57 ^🔗	brayden	well there you go
04:57 ^🔗	brayden	your web server is weird :P
04:57 ^🔗	xk_id	I suppose my server was doing some parsing?
04:57 ^🔗	xk_id	strange, but okay.
04:57 ^🔗	brayden	Looks like it gave some JSON-like output?
04:58 ^🔗	xk_id	oh
04:58 ^🔗	xk_id	yes, that's correct
04:58 ^🔗	xk_id	Now, I still have the problem of finding a bunch of genuine headers.
05:05 ^🔗	xk_id	Well I suppose I could just capitalise them.
05:22 ^🔗	xk_id	brayden: can I direct my crawler to your server to test its headers, pelase?
05:22 ^🔗	brayden	I don't have a script to return headers
05:22 ^🔗	xk_id	ah, okay.
05:23 ^🔗	xk_id	I'm a bit concerned because when I define the headers, I define them in JSON. So I'm not sure what Node.js is doing with the objects afterwards.
05:23 ^🔗	xk_id	fingers crossed.
05:24 ^🔗	brayden	oh
05:24 ^🔗	brayden	Do nc -lk 80
05:24 ^🔗	brayden	where 80 is the port
05:24 ^🔗	brayden	k keeps it open after the connection has been closed, i.e. the script open
05:24 ^🔗	brayden	It should send headers
05:34 ^🔗	xk_id	brayden: nice :) They appear capitalised. Besides 'host' :/ which I haven't configured...
05:34 ^🔗	brayden	nice
05:35 ^🔗	xk_id	I will add 'host' to my customised headers. I think by default the httpclient I'm using makes it lower case.
05:36 ^🔗	xk_id	and thanks. First time using netcat, actually (I know).
05:37 ^🔗	brayden	I've only ever used netcat in a project like once but fortunately its syntax is pretty simple
05:37 ^🔗	brayden	Since there was a bash script that, part of its functionality, would listen to connections from a master to slaves
05:38 ^🔗	xk_id	very handy tool.
05:40 ^🔗	xk_id	Great, my client overwrites the 'host' header. I think I need to fiddle with the source.
05:41 ^🔗	brayden	if you have nmap installed you get ncat as well which has SSL!
07:18 ^🔗	lemonkey	stickam shutting down
07:47 ^🔗	lemonkey	http://blog.stickam.com/post/41909003713/stickamclosing
08:00 ^🔗	db48x	yep
08:04 ^🔗	db48x	we need to get started on it
08:07 ^🔗	db48x	hmm, no wiki page yet
08:10 ^🔗	db48x	wait, closing January 31st?
08:11 ^🔗	SketchCow	Morning.
08:11 ^🔗	SketchCow	It's 9:11am in East Berlin, now Berlin
08:13 ^🔗	db48x	oh, it begins closing 12 minutes ago, and dissapears Feburary 28th
08:13 ^🔗	Deewiant	Presumably the January 31st bit indicates read-only mode
08:13 ^🔗	db48x	yea
08:14 ^🔗	db48x	for a second there I thought that had given a whole day's notice
08:15 ^🔗	Deewiant	It seems that all pages are replaced with the memorial note?
08:15 ^🔗	adamcaudi	Yeah, looks like they just took at all down
08:15 ^🔗	adamcaudi	*it
08:16 ^🔗	db48x	ouch
08:16 ^🔗	db48x	I could browse groups a few minutes ago
08:16 ^🔗	db48x	there was even a live stream in progress on the front page
08:16 ^🔗	Deewiant	Google cache has some stuff, with images still up at least
08:17 ^🔗	db48x	and hundreds of people in chat rooms
08:18 ^🔗	Deewiant	Their "random video from the Stickam archives" player doesn't seem to work at least for me: staging.stickam-player.stk doesn't resolve
08:19 ^🔗	Deewiant	Aha, https still works!
08:19 ^🔗	Deewiant	E.g. http://www.stickam.com/theoneringnet vs https://www.stickam.com/theoneringnet
08:21 ^🔗	db48x	ooh
08:22 ^🔗	db48x	no, only partially
08:22 ^🔗	db48x	groups are gone
08:22 ^🔗	SketchCow	Wow, they proabaly lost a lot of money really quickly.
08:22 ^🔗	SketchCow	Someone shut off the tap
08:22 ^🔗	db48x	who's Live is all empty
08:22 ^🔗	db48x	maybe they're pulling data from http though
08:22 ^🔗	db48x	SketchCow: yea
08:26 ^🔗	SketchCow	We might be screwed here, which is understandable.
08:26 ^🔗	db48x	the wording of the message was pretty misleading, too
08:27 ^🔗	db48x	it said that the site would remain alive until the 28th
08:27 ^🔗	db48x	well, I updated the wiki page
08:27 ^🔗	db48x	for whatever that's worth
08:29 ^🔗	adamcaudi	Looks like the https version of the "who's online" page still works - leads to working profiles, and working group pages
08:29 ^🔗	db48x	we might be able to spider the https site
08:29 ^🔗	db48x	adamcaudi: yea
08:29 ^🔗	db48x	I can't get any videos to load though
08:31 ^🔗	db48x	heh, clicking on the Randomizer button off to the side pops up an alert saying 'There is no live user.'
08:38 ^🔗	SketchCow	Wow
08:43 ^🔗	ersi	SketchCow: Hey timezone buddy.
08:43 ^🔗	db48x	I guess we have to go down the list of social networks in the wiki and just do them all now
08:45 ^🔗	ersi	"The site will remain alive here until February 28, 2013." from the StickAm post.
08:45 ^🔗	db48x	ersi: yea
08:45 ^🔗	db48x	and technically it still is there, and if you have an account you can log in and download your videos
08:45 ^🔗	ersi	Also, that was a fucking disasterous background on that blog post.. Barely readable.
08:45 ^🔗	SketchCow	Where's the visiting me
08:46 ^🔗	ersi	oops, I missed your line re 28th of february. Thought no one mentioned that
08:46 ^🔗	SketchCow	Or are you another one of the archive team members who makes $5 a week
08:46 ^🔗	db48x	so perhaps we could accidentally liberate a username/password list and just download everything ourselves
08:48 ^🔗	db48x	http://www.archiveteam.org/index.php?title=Stickam
08:49 ^🔗	db48x	that squarish dude with the sad eyes in their goodby banner would make a good image for the page :P
08:51 ^🔗	db48x	hrm, http://player.stickam.com/flash/stickam/stickam_player.swf still exists, sorta
08:52 ^🔗	db48x	it's still a real swf file
08:52 ^🔗	SketchCow	I just shifted over the videos, godane.
08:52 ^🔗	SketchCow	So everything that's in g4video by mistake is where it should be
09:02 ^🔗	xk_id	Anybody has any tips on figuring out if I'm getting a "hello world" page instead of the actual page I wish to crawl? (i.e getting 'blacklisted' by the website)
09:03 ^🔗	xk_id	actually..... no, there can't be.
09:03 ^🔗	xk_id	even a human wouldn't be able to tell.
09:03 ^🔗	ersi	By finding how their "Fuck you page" looks and then knowing how it looks and looking for it :)
09:04 ^🔗	xk_id	:D
09:04 ^🔗	adamcaudi	xk_id, are you sure you're past the host right?
09:04 ^🔗	ersi	Most likely, you'd get firewall'd off or 404'd/500'd or something
09:04 ^🔗	xk_id	adamcaudi: what do you mean? the lower case header?
09:05 ^🔗	adamcaudi	xk_id, many servers return a default page if it can't find / understand what host you are asking for
09:06 ^🔗	xk_id	Sorry, I'm still not sure what you mean :) why would my host be illogical?
09:08 ^🔗	xk_id	as far as I can tell, my spider sends intelligible headers, and the RFC says they are not case sensitive
09:10 ^🔗	adamcaudi	I've seen case sensitive implementaions - even though the RFC says it doesn't matter
09:11 ^🔗	xk_id	Unfortunately there's not much I can do at the moment. It has to do with the module I'm using. We've tried modifying the source code, but we're afraid of breaking something
09:11 ^🔗	xk_id	we're waiting for a developer to reply: https://github.com/mikeal/request/issues/426
09:12 ^🔗	xk_id	thanks for the heads up...
09:14 ^🔗	xk_id	adamcaudi: do you have some reference I could add on github?
09:14 ^🔗	xk_id	perhaps it will press devs to respond
09:15 ^🔗	adamcaudi	I'll let you know if I can remember which server it was - been some time, can't think of which one it is right now
09:16 ^🔗	xk_id	Ok
09:17 ^🔗	adamcaudi	Do you have the actual request that was sent? Curious to see if there's something else odd with it
09:22 ^🔗	xk_id	adamcaudi: par example http://dpaste.com/903092/
09:22 ^🔗	xk_id	as captured by netcat
09:26 ^🔗	xk_id	They're artificially created by me, btw
09:26 ^🔗	xk_id	well, at the upper-cased ones at least :P
09:26 ^🔗	xk_id	without *at
09:36 ^🔗	adamcaudi	Only thing that jumps out at me is the order is odd - host is normally the first header (so second line), but that shouldn't change anything
09:39 ^🔗	xk_id	hmm...
10:26 ^🔗	Nemo_bis	http://xkcd.com/1168/
10:27 ^🔗	ersi	#archiveteam-bs man
10:27 ^🔗	Nemo_bis	I don't think so
12:44 ^🔗	godane	this sucks
12:45 ^🔗	godane	i downloaded nerds 2.0 pbs series
12:45 ^🔗	godane	looks like the video is only under 600kps when the file is 891mb
12:45 ^🔗	godane	this is cause the audio is pcm and has a bitrate of 1411kps
14:35 ^🔗	turnkit	someone fell asleep encoding that one :(
14:45 ^🔗	godane	its still very watchable
14:45 ^🔗	godane	and when its devide there will be a smaller one
14:49 ^🔗	godane	i'm uploading a blockbuster customer service tape from 2000
14:50 ^🔗	godane	i got another one also that i will upload called the different guest
22:35 ^🔗	S[h]O[r]T	is there any public effort into archiving pastebin type sites?
22:35 ^🔗	S[h]O[r]T	*ongoing
22:44 ^🔗	ersi	Not that I'm aware of

irclogger-viewer