Time |
Nickname |
Message |
01:48
🔗
|
Brenry |
on those geocities archive sites.. did they ever scrape user data ? or just those neighborhood things |
04:27
🔗
|
xk_id |
Is 'accept-encoding': 'gzip, deflate' a non-suspicious header to use for my crawler? |
04:27
🔗
|
xk_id |
e.g http://pastebin.com/Fdxxs7We |
04:37
🔗
|
brayden |
I think it is a bit weird you're using a pretty old version of Firefox on that header but gzip, deflate seems to be standard. |
04:37
🔗
|
brayden |
that's what I'm seeing on my connection to pastebin |
04:37
🔗
|
brayden |
Do you actually capitalise the headers though? |
04:37
🔗
|
brayden |
As they appear to be capitalised normally |
04:38
🔗
|
brayden |
i.e. User-Agent as opposed to user-agent |
04:40
🔗
|
xk_id |
hmm.. |
04:41
🔗
|
xk_id |
My Safari seems not to capitalise them |
04:41
🔗
|
xk_id |
but Chrome does. |
04:41
🔗
|
xk_id |
Strangely, however, Chrome sends a bit more headers as well.... |
04:44
🔗
|
xk_id |
wait, no, what am I saying, Chrome doesn't capitalise either. |
04:45
🔗
|
* |
brayden opens wireshark |
04:45
🔗
|
brayden |
Firefox does |
04:46
🔗
|
xk_id |
It's strange that I cannot find a list with full real headers on google. |
04:46
🔗
|
xk_id |
I guess I'll have to catch them myself locally from the browsers I have. |
04:46
🔗
|
brayden |
http://brayden.ur.cx/images/2013-02-01_12-46-27.png is part of it. |
04:47
🔗
|
brayden |
Chrome seems to capitalise too |
04:47
🔗
|
xk_id |
How are you catching them? |
04:47
🔗
|
brayden |
Wireshark with filter on HTTP |
04:48
🔗
|
brayden |
I also have a plugin on Mozilla that shows me request headers and responses. |
04:49
🔗
|
xk_id |
I've created a small server and am printing the requests... it looks like this for Chrome: http://i.imgur.com/AsLGAZW.png |
04:51
🔗
|
brayden |
Just did a packet capture on the server with tcpdump and it is showing what wireshark showed. |
04:51
🔗
|
brayden |
0x0060: 3a20 6b65 6570 2d61 6c69 7665 0d0a 4163 :.keep-alive..Ac |
04:51
🔗
|
brayden |
albeit a bit squished |
04:51
🔗
|
brayden |
0x0070: 6365 7074 3a20 2a2f 2a0d 0a55 7365 722d cept:.*/*..User- |
04:51
🔗
|
brayden |
0x0080: 4167 656e 743a 204d 6f7a 696c 6c61 2f35 Agent:.Mozilla/5 |
04:57
🔗
|
xk_id |
I'm also getting capitals with Wire Shark... |
04:57
🔗
|
xk_id |
what the hell... |
04:57
🔗
|
brayden |
well there you go |
04:57
🔗
|
brayden |
your web server is weird :P |
04:57
🔗
|
xk_id |
I suppose my server was doing some parsing? |
04:57
🔗
|
xk_id |
strange, but okay. |
04:57
🔗
|
brayden |
Looks like it gave some JSON-like output? |
04:58
🔗
|
xk_id |
oh |
04:58
🔗
|
xk_id |
yes, that's correct |
04:58
🔗
|
xk_id |
Now, I still have the problem of finding a bunch of genuine headers. |
05:05
🔗
|
xk_id |
Well I suppose I could just capitalise them. |
05:22
🔗
|
xk_id |
brayden: can I direct my crawler to your server to test its headers, pelase? |
05:22
🔗
|
brayden |
I don't have a script to return headers |
05:22
🔗
|
xk_id |
ah, okay. |
05:23
🔗
|
xk_id |
I'm a bit concerned because when I define the headers, I define them in JSON. So I'm not sure what Node.js is doing with the objects afterwards. |
05:23
🔗
|
xk_id |
fingers crossed. |
05:24
🔗
|
brayden |
oh |
05:24
🔗
|
brayden |
Do nc -lk 80 |
05:24
🔗
|
brayden |
where 80 is the port |
05:24
🔗
|
brayden |
k keeps it open after the connection has been closed, i.e. the script open |
05:24
🔗
|
brayden |
It should send headers |
05:34
🔗
|
xk_id |
brayden: nice :) They appear capitalised. Besides 'host' :/ which I haven't configured... |
05:34
🔗
|
brayden |
nice |
05:35
🔗
|
xk_id |
I will add 'host' to my customised headers. I think by default the httpclient I'm using makes it lower case. |
05:36
🔗
|
xk_id |
and thanks. First time using netcat, actually (I know). |
05:37
🔗
|
brayden |
I've only ever used netcat in a project like once but fortunately its syntax is pretty simple |
05:37
🔗
|
brayden |
Since there was a bash script that, part of its functionality, would listen to connections from a master to slaves |
05:38
🔗
|
xk_id |
very handy tool. |
05:40
🔗
|
xk_id |
Great, my client overwrites the 'host' header. I think I need to fiddle with the source. |
05:41
🔗
|
brayden |
if you have nmap installed you get ncat as well which has SSL! |
07:18
🔗
|
lemonkey |
stickam shutting down |
07:47
🔗
|
lemonkey |
http://blog.stickam.com/post/41909003713/stickamclosing |
08:00
🔗
|
db48x |
yep |
08:04
🔗
|
db48x |
we need to get started on it |
08:07
🔗
|
db48x |
hmm, no wiki page yet |
08:10
🔗
|
db48x |
wait, closing January 31st? |
08:11
🔗
|
SketchCow |
Morning. |
08:11
🔗
|
SketchCow |
It's 9:11am in East Berlin, now Berlin |
08:13
🔗
|
db48x |
oh, it begins closing 12 minutes ago, and dissapears Feburary 28th |
08:13
🔗
|
Deewiant |
Presumably the January 31st bit indicates read-only mode |
08:13
🔗
|
db48x |
yea |
08:14
🔗
|
db48x |
for a second there I thought that had given a whole day's notice |
08:15
🔗
|
Deewiant |
It seems that all pages are replaced with the memorial note? |
08:15
🔗
|
adamcaudi |
Yeah, looks like they just took at all down |
08:15
🔗
|
adamcaudi |
*it |
08:16
🔗
|
db48x |
ouch |
08:16
🔗
|
db48x |
I could browse groups a few minutes ago |
08:16
🔗
|
db48x |
there was even a live stream in progress on the front page |
08:16
🔗
|
Deewiant |
Google cache has some stuff, with images still up at least |
08:17
🔗
|
db48x |
and hundreds of people in chat rooms |
08:18
🔗
|
Deewiant |
Their "random video from the Stickam archives" player doesn't seem to work at least for me: staging.stickam-player.stk doesn't resolve |
08:19
🔗
|
Deewiant |
Aha, https still works! |
08:19
🔗
|
Deewiant |
E.g. http://www.stickam.com/theoneringnet vs https://www.stickam.com/theoneringnet |
08:21
🔗
|
db48x |
ooh |
08:22
🔗
|
db48x |
no, only partially |
08:22
🔗
|
db48x |
groups are gone |
08:22
🔗
|
SketchCow |
Wow, they proabaly lost a lot of money really quickly. |
08:22
🔗
|
SketchCow |
Someone shut off the tap |
08:22
🔗
|
db48x |
who's Live is all empty |
08:22
🔗
|
db48x |
maybe they're pulling data from http though |
08:22
🔗
|
db48x |
SketchCow: yea |
08:26
🔗
|
SketchCow |
We might be screwed here, which is understandable. |
08:26
🔗
|
db48x |
the wording of the message was pretty misleading, too |
08:27
🔗
|
db48x |
it said that the site would remain alive until the 28th |
08:27
🔗
|
db48x |
well, I updated the wiki page |
08:27
🔗
|
db48x |
for whatever that's worth |
08:29
🔗
|
adamcaudi |
Looks like the https version of the "who's online" page still works - leads to working profiles, and working group pages |
08:29
🔗
|
db48x |
we might be able to spider the https site |
08:29
🔗
|
db48x |
adamcaudi: yea |
08:29
🔗
|
db48x |
I can't get any videos to load though |
08:31
🔗
|
db48x |
heh, clicking on the Randomizer button off to the side pops up an alert saying 'There is no live user.' |
08:38
🔗
|
SketchCow |
Wow |
08:43
🔗
|
ersi |
SketchCow: Hey timezone buddy. |
08:43
🔗
|
db48x |
I guess we have to go down the list of social networks in the wiki and just do them all now |
08:45
🔗
|
ersi |
"The site will remain alive here until February 28, 2013." from the StickAm post. |
08:45
🔗
|
db48x |
ersi: yea |
08:45
🔗
|
db48x |
and technically it still is there, and if you have an account you can log in and download your videos |
08:45
🔗
|
ersi |
Also, that was a fucking disasterous background on that blog post.. Barely readable. |
08:45
🔗
|
SketchCow |
Where's the visiting me |
08:46
🔗
|
ersi |
oops, I missed your line re 28th of february. Thought no one mentioned that |
08:46
🔗
|
SketchCow |
Or are you another one of the archive team members who makes $5 a week |
08:46
🔗
|
db48x |
so perhaps we could accidentally liberate a username/password list and just download everything ourselves |
08:48
🔗
|
db48x |
http://www.archiveteam.org/index.php?title=Stickam |
08:49
🔗
|
db48x |
that squarish dude with the sad eyes in their goodby banner would make a good image for the page :P |
08:51
🔗
|
db48x |
hrm, http://player.stickam.com/flash/stickam/stickam_player.swf still exists, sorta |
08:52
🔗
|
db48x |
it's still a real swf file |
08:52
🔗
|
SketchCow |
I just shifted over the videos, godane. |
08:52
🔗
|
SketchCow |
So everything that's in g4video by mistake is where it should be |
09:02
🔗
|
xk_id |
Anybody has any tips on figuring out if I'm getting a "hello world" page instead of the actual page I wish to crawl? (i.e getting 'blacklisted' by the website) |
09:03
🔗
|
xk_id |
actually..... no, there can't be. |
09:03
🔗
|
xk_id |
even a human wouldn't be able to tell. |
09:03
🔗
|
ersi |
By finding how their "Fuck you page" looks and then knowing how it looks and looking for it :) |
09:04
🔗
|
xk_id |
:D |
09:04
🔗
|
adamcaudi |
xk_id, are you sure you're past the host right? |
09:04
🔗
|
ersi |
Most likely, you'd get firewall'd off or 404'd/500'd or something |
09:04
🔗
|
xk_id |
adamcaudi: what do you mean? the lower case header? |
09:05
🔗
|
adamcaudi |
xk_id, many servers return a default page if it can't find / understand what host you are asking for |
09:06
🔗
|
xk_id |
Sorry, I'm still not sure what you mean :) why would my host be illogical? |
09:08
🔗
|
xk_id |
as far as I can tell, my spider sends intelligible headers, and the RFC says they are not case sensitive |
09:10
🔗
|
adamcaudi |
I've seen case sensitive implementaions - even though the RFC says it doesn't matter |
09:11
🔗
|
xk_id |
Unfortunately there's not much I can do at the moment. It has to do with the module I'm using. We've tried modifying the source code, but we're afraid of breaking something |
09:11
🔗
|
xk_id |
we're waiting for a developer to reply: https://github.com/mikeal/request/issues/426 |
09:12
🔗
|
xk_id |
thanks for the heads up... |
09:14
🔗
|
xk_id |
adamcaudi: do you have some reference I could add on github? |
09:14
🔗
|
xk_id |
perhaps it will press devs to respond |
09:15
🔗
|
adamcaudi |
I'll let you know if I can remember which server it was - been some time, can't think of which one it is right now |
09:16
🔗
|
xk_id |
Ok |
09:17
🔗
|
adamcaudi |
Do you have the actual request that was sent? Curious to see if there's something else odd with it |
09:22
🔗
|
xk_id |
adamcaudi: par example http://dpaste.com/903092/ |
09:22
🔗
|
xk_id |
as captured by netcat |
09:26
🔗
|
xk_id |
They're artificially created by me, btw |
09:26
🔗
|
xk_id |
well, at the upper-cased ones at least :P |
09:26
🔗
|
xk_id |
without *at |
09:36
🔗
|
adamcaudi |
Only thing that jumps out at me is the order is odd - host is normally the first header (so second line), but that shouldn't change anything |
09:39
🔗
|
xk_id |
hmm... |
10:26
🔗
|
Nemo_bis |
http://xkcd.com/1168/ |
10:27
🔗
|
ersi |
#archiveteam-bs man |
10:27
🔗
|
Nemo_bis |
I don't think so |
12:44
🔗
|
godane |
this sucks |
12:45
🔗
|
godane |
i downloaded nerds 2.0 pbs series |
12:45
🔗
|
godane |
looks like the video is only under 600kps when the file is 891mb |
12:45
🔗
|
godane |
this is cause the audio is pcm and has a bitrate of 1411kps |
14:35
🔗
|
turnkit |
someone fell asleep encoding that one :( |
14:45
🔗
|
godane |
its still very watchable |
14:45
🔗
|
godane |
and when its devide there will be a smaller one |
14:49
🔗
|
godane |
i'm uploading a blockbuster customer service tape from 2000 |
14:50
🔗
|
godane |
i got another one also that i will upload called the different guest |
22:35
🔗
|
S[h]O[r]T |
is there any public effort into archiving pastebin type sites? |
22:35
🔗
|
S[h]O[r]T |
*ongoing |
22:44
🔗
|
ersi |
Not that I'm aware of |