| Time |
Nickname |
Message |
|
01:48
🔗
|
Brenry |
on those geocities archive sites.. did they ever scrape user data ? or just those neighborhood things |
|
04:27
🔗
|
xk_id |
Is 'accept-encoding': 'gzip, deflate' a non-suspicious header to use for my crawler? |
|
04:27
🔗
|
xk_id |
e.g http://pastebin.com/Fdxxs7We |
|
04:37
🔗
|
brayden |
I think it is a bit weird you're using a pretty old version of Firefox on that header but gzip, deflate seems to be standard. |
|
04:37
🔗
|
brayden |
that's what I'm seeing on my connection to pastebin |
|
04:37
🔗
|
brayden |
Do you actually capitalise the headers though? |
|
04:37
🔗
|
brayden |
As they appear to be capitalised normally |
|
04:38
🔗
|
brayden |
i.e. User-Agent as opposed to user-agent |
|
04:40
🔗
|
xk_id |
hmm.. |
|
04:41
🔗
|
xk_id |
My Safari seems not to capitalise them |
|
04:41
🔗
|
xk_id |
but Chrome does. |
|
04:41
🔗
|
xk_id |
Strangely, however, Chrome sends a bit more headers as well.... |
|
04:44
🔗
|
xk_id |
wait, no, what am I saying, Chrome doesn't capitalise either. |
|
04:45
🔗
|
* |
brayden opens wireshark |
|
04:45
🔗
|
brayden |
Firefox does |
|
04:46
🔗
|
xk_id |
It's strange that I cannot find a list with full real headers on google. |
|
04:46
🔗
|
xk_id |
I guess I'll have to catch them myself locally from the browsers I have. |
|
04:46
🔗
|
brayden |
http://brayden.ur.cx/images/2013-02-01_12-46-27.png is part of it. |
|
04:47
🔗
|
brayden |
Chrome seems to capitalise too |
|
04:47
🔗
|
xk_id |
How are you catching them? |
|
04:47
🔗
|
brayden |
Wireshark with filter on HTTP |
|
04:48
🔗
|
brayden |
I also have a plugin on Mozilla that shows me request headers and responses. |
|
04:49
🔗
|
xk_id |
I've created a small server and am printing the requests... it looks like this for Chrome: http://i.imgur.com/AsLGAZW.png |
|
04:51
🔗
|
brayden |
Just did a packet capture on the server with tcpdump and it is showing what wireshark showed. |
|
04:51
🔗
|
brayden |
0x0060: 3a20 6b65 6570 2d61 6c69 7665 0d0a 4163 :.keep-alive..Ac |
|
04:51
🔗
|
brayden |
albeit a bit squished |
|
04:51
🔗
|
brayden |
0x0070: 6365 7074 3a20 2a2f 2a0d 0a55 7365 722d cept:.*/*..User- |
|
04:51
🔗
|
brayden |
0x0080: 4167 656e 743a 204d 6f7a 696c 6c61 2f35 Agent:.Mozilla/5 |
|
04:57
🔗
|
xk_id |
I'm also getting capitals with Wire Shark... |
|
04:57
🔗
|
xk_id |
what the hell... |
|
04:57
🔗
|
brayden |
well there you go |
|
04:57
🔗
|
brayden |
your web server is weird :P |
|
04:57
🔗
|
xk_id |
I suppose my server was doing some parsing? |
|
04:57
🔗
|
xk_id |
strange, but okay. |
|
04:57
🔗
|
brayden |
Looks like it gave some JSON-like output? |
|
04:58
🔗
|
xk_id |
oh |
|
04:58
🔗
|
xk_id |
yes, that's correct |
|
04:58
🔗
|
xk_id |
Now, I still have the problem of finding a bunch of genuine headers. |
|
05:05
🔗
|
xk_id |
Well I suppose I could just capitalise them. |
|
05:22
🔗
|
xk_id |
brayden: can I direct my crawler to your server to test its headers, pelase? |
|
05:22
🔗
|
brayden |
I don't have a script to return headers |
|
05:22
🔗
|
xk_id |
ah, okay. |
|
05:23
🔗
|
xk_id |
I'm a bit concerned because when I define the headers, I define them in JSON. So I'm not sure what Node.js is doing with the objects afterwards. |
|
05:23
🔗
|
xk_id |
fingers crossed. |
|
05:24
🔗
|
brayden |
oh |
|
05:24
🔗
|
brayden |
Do nc -lk 80 |
|
05:24
🔗
|
brayden |
where 80 is the port |
|
05:24
🔗
|
brayden |
k keeps it open after the connection has been closed, i.e. the script open |
|
05:24
🔗
|
brayden |
It should send headers |
|
05:34
🔗
|
xk_id |
brayden: nice :) They appear capitalised. Besides 'host' :/ which I haven't configured... |
|
05:34
🔗
|
brayden |
nice |
|
05:35
🔗
|
xk_id |
I will add 'host' to my customised headers. I think by default the httpclient I'm using makes it lower case. |
|
05:36
🔗
|
xk_id |
and thanks. First time using netcat, actually (I know). |
|
05:37
🔗
|
brayden |
I've only ever used netcat in a project like once but fortunately its syntax is pretty simple |
|
05:37
🔗
|
brayden |
Since there was a bash script that, part of its functionality, would listen to connections from a master to slaves |
|
05:38
🔗
|
xk_id |
very handy tool. |
|
05:40
🔗
|
xk_id |
Great, my client overwrites the 'host' header. I think I need to fiddle with the source. |
|
05:41
🔗
|
brayden |
if you have nmap installed you get ncat as well which has SSL! |
|
07:18
🔗
|
lemonkey |
stickam shutting down |
|
07:47
🔗
|
lemonkey |
http://blog.stickam.com/post/41909003713/stickamclosing |
|
08:00
🔗
|
db48x |
yep |
|
08:04
🔗
|
db48x |
we need to get started on it |
|
08:07
🔗
|
db48x |
hmm, no wiki page yet |
|
08:10
🔗
|
db48x |
wait, closing January 31st? |
|
08:11
🔗
|
SketchCow |
Morning. |
|
08:11
🔗
|
SketchCow |
It's 9:11am in East Berlin, now Berlin |
|
08:13
🔗
|
db48x |
oh, it begins closing 12 minutes ago, and dissapears Feburary 28th |
|
08:13
🔗
|
Deewiant |
Presumably the January 31st bit indicates read-only mode |
|
08:13
🔗
|
db48x |
yea |
|
08:14
🔗
|
db48x |
for a second there I thought that had given a whole day's notice |
|
08:15
🔗
|
Deewiant |
It seems that all pages are replaced with the memorial note? |
|
08:15
🔗
|
adamcaudi |
Yeah, looks like they just took at all down |
|
08:15
🔗
|
adamcaudi |
*it |
|
08:16
🔗
|
db48x |
ouch |
|
08:16
🔗
|
db48x |
I could browse groups a few minutes ago |
|
08:16
🔗
|
db48x |
there was even a live stream in progress on the front page |
|
08:16
🔗
|
Deewiant |
Google cache has some stuff, with images still up at least |
|
08:17
🔗
|
db48x |
and hundreds of people in chat rooms |
|
08:18
🔗
|
Deewiant |
Their "random video from the Stickam archives" player doesn't seem to work at least for me: staging.stickam-player.stk doesn't resolve |
|
08:19
🔗
|
Deewiant |
Aha, https still works! |
|
08:19
🔗
|
Deewiant |
E.g. http://www.stickam.com/theoneringnet vs https://www.stickam.com/theoneringnet |
|
08:21
🔗
|
db48x |
ooh |
|
08:22
🔗
|
db48x |
no, only partially |
|
08:22
🔗
|
db48x |
groups are gone |
|
08:22
🔗
|
SketchCow |
Wow, they proabaly lost a lot of money really quickly. |
|
08:22
🔗
|
SketchCow |
Someone shut off the tap |
|
08:22
🔗
|
db48x |
who's Live is all empty |
|
08:22
🔗
|
db48x |
maybe they're pulling data from http though |
|
08:22
🔗
|
db48x |
SketchCow: yea |
|
08:26
🔗
|
SketchCow |
We might be screwed here, which is understandable. |
|
08:26
🔗
|
db48x |
the wording of the message was pretty misleading, too |
|
08:27
🔗
|
db48x |
it said that the site would remain alive until the 28th |
|
08:27
🔗
|
db48x |
well, I updated the wiki page |
|
08:27
🔗
|
db48x |
for whatever that's worth |
|
08:29
🔗
|
adamcaudi |
Looks like the https version of the "who's online" page still works - leads to working profiles, and working group pages |
|
08:29
🔗
|
db48x |
we might be able to spider the https site |
|
08:29
🔗
|
db48x |
adamcaudi: yea |
|
08:29
🔗
|
db48x |
I can't get any videos to load though |
|
08:31
🔗
|
db48x |
heh, clicking on the Randomizer button off to the side pops up an alert saying 'There is no live user.' |
|
08:38
🔗
|
SketchCow |
Wow |
|
08:43
🔗
|
ersi |
SketchCow: Hey timezone buddy. |
|
08:43
🔗
|
db48x |
I guess we have to go down the list of social networks in the wiki and just do them all now |
|
08:45
🔗
|
ersi |
"The site will remain alive here until February 28, 2013." from the StickAm post. |
|
08:45
🔗
|
db48x |
ersi: yea |
|
08:45
🔗
|
db48x |
and technically it still is there, and if you have an account you can log in and download your videos |
|
08:45
🔗
|
ersi |
Also, that was a fucking disasterous background on that blog post.. Barely readable. |
|
08:45
🔗
|
SketchCow |
Where's the visiting me |
|
08:46
🔗
|
ersi |
oops, I missed your line re 28th of february. Thought no one mentioned that |
|
08:46
🔗
|
SketchCow |
Or are you another one of the archive team members who makes $5 a week |
|
08:46
🔗
|
db48x |
so perhaps we could accidentally liberate a username/password list and just download everything ourselves |
|
08:48
🔗
|
db48x |
http://www.archiveteam.org/index.php?title=Stickam |
|
08:49
🔗
|
db48x |
that squarish dude with the sad eyes in their goodby banner would make a good image for the page :P |
|
08:51
🔗
|
db48x |
hrm, http://player.stickam.com/flash/stickam/stickam_player.swf still exists, sorta |
|
08:52
🔗
|
db48x |
it's still a real swf file |
|
08:52
🔗
|
SketchCow |
I just shifted over the videos, godane. |
|
08:52
🔗
|
SketchCow |
So everything that's in g4video by mistake is where it should be |
|
09:02
🔗
|
xk_id |
Anybody has any tips on figuring out if I'm getting a "hello world" page instead of the actual page I wish to crawl? (i.e getting 'blacklisted' by the website) |
|
09:03
🔗
|
xk_id |
actually..... no, there can't be. |
|
09:03
🔗
|
xk_id |
even a human wouldn't be able to tell. |
|
09:03
🔗
|
ersi |
By finding how their "Fuck you page" looks and then knowing how it looks and looking for it :) |
|
09:04
🔗
|
xk_id |
:D |
|
09:04
🔗
|
adamcaudi |
xk_id, are you sure you're past the host right? |
|
09:04
🔗
|
ersi |
Most likely, you'd get firewall'd off or 404'd/500'd or something |
|
09:04
🔗
|
xk_id |
adamcaudi: what do you mean? the lower case header? |
|
09:05
🔗
|
adamcaudi |
xk_id, many servers return a default page if it can't find / understand what host you are asking for |
|
09:06
🔗
|
xk_id |
Sorry, I'm still not sure what you mean :) why would my host be illogical? |
|
09:08
🔗
|
xk_id |
as far as I can tell, my spider sends intelligible headers, and the RFC says they are not case sensitive |
|
09:10
🔗
|
adamcaudi |
I've seen case sensitive implementaions - even though the RFC says it doesn't matter |
|
09:11
🔗
|
xk_id |
Unfortunately there's not much I can do at the moment. It has to do with the module I'm using. We've tried modifying the source code, but we're afraid of breaking something |
|
09:11
🔗
|
xk_id |
we're waiting for a developer to reply: https://github.com/mikeal/request/issues/426 |
|
09:12
🔗
|
xk_id |
thanks for the heads up... |
|
09:14
🔗
|
xk_id |
adamcaudi: do you have some reference I could add on github? |
|
09:14
🔗
|
xk_id |
perhaps it will press devs to respond |
|
09:15
🔗
|
adamcaudi |
I'll let you know if I can remember which server it was - been some time, can't think of which one it is right now |
|
09:16
🔗
|
xk_id |
Ok |
|
09:17
🔗
|
adamcaudi |
Do you have the actual request that was sent? Curious to see if there's something else odd with it |
|
09:22
🔗
|
xk_id |
adamcaudi: par example http://dpaste.com/903092/ |
|
09:22
🔗
|
xk_id |
as captured by netcat |
|
09:26
🔗
|
xk_id |
They're artificially created by me, btw |
|
09:26
🔗
|
xk_id |
well, at the upper-cased ones at least :P |
|
09:26
🔗
|
xk_id |
without *at |
|
09:36
🔗
|
adamcaudi |
Only thing that jumps out at me is the order is odd - host is normally the first header (so second line), but that shouldn't change anything |
|
09:39
🔗
|
xk_id |
hmm... |
|
10:26
🔗
|
Nemo_bis |
http://xkcd.com/1168/ |
|
10:27
🔗
|
ersi |
#archiveteam-bs man |
|
10:27
🔗
|
Nemo_bis |
I don't think so |
|
12:44
🔗
|
godane |
this sucks |
|
12:45
🔗
|
godane |
i downloaded nerds 2.0 pbs series |
|
12:45
🔗
|
godane |
looks like the video is only under 600kps when the file is 891mb |
|
12:45
🔗
|
godane |
this is cause the audio is pcm and has a bitrate of 1411kps |
|
14:35
🔗
|
turnkit |
someone fell asleep encoding that one :( |
|
14:45
🔗
|
godane |
its still very watchable |
|
14:45
🔗
|
godane |
and when its devide there will be a smaller one |
|
14:49
🔗
|
godane |
i'm uploading a blockbuster customer service tape from 2000 |
|
14:50
🔗
|
godane |
i got another one also that i will upload called the different guest |
|
22:35
🔗
|
S[h]O[r]T |
is there any public effort into archiving pastebin type sites? |
|
22:35
🔗
|
S[h]O[r]T |
*ongoing |
|
22:44
🔗
|
ersi |
Not that I'm aware of |