Time |
Nickname |
Message |
01:16
🔗
|
godane |
SketchCow: I'm backing up defcon.org |
01:17
🔗
|
godane |
its only 3.2gb |
01:17
🔗
|
godane |
i'm not backing up the top domains of defcon.org like media.defcon.org |
04:12
🔗
|
yipdw |
heh |
04:12
🔗
|
yipdw |
7008 yipdw 20 0 380m 81m 11m S 390 0.7 10:37.05 rbx |
04:12
🔗
|
yipdw |
that's what I like to see, 390% CPU utilization from Ruby |
04:12
🔗
|
yipdw |
almost all cores in use :D |
04:13
🔗
|
chronomex |
awwwwyeah |
04:32
🔗
|
tef |
yipdw: what are you using to archive it ? |
04:33
🔗
|
yipdw |
tef: archive what? |
04:33
🔗
|
tef |
ah shit I misread scrollback |
04:33
🔗
|
tef |
was assuming that was what ruby was up to |
04:33
🔗
|
tef |
christ how did I get ops? |
04:33
🔗
|
yipdw |
oh |
04:33
🔗
|
yipdw |
heh |
04:34
🔗
|
yipdw |
no, it's crawling a graph |
04:35
🔗
|
Coderjoe |
149,566,088 bytes before, 215,611,184 after |
04:35
🔗
|
Coderjoe |
you've nearly doubled the redis memory usage |
04:35
🔗
|
yipdw |
I stopped writing to it |
04:36
🔗
|
Coderjoe |
oh |
04:36
🔗
|
yipdw |
I'm using my home machine right now because it can give me better performance |
04:36
🔗
|
yipdw |
at some point I'll resync and move the crawler back |
04:37
🔗
|
yipdw |
and by "better" I mean that I can get on the order of 1k work items/sec processed |
04:37
🔗
|
yipdw |
the m1.small EC2 instance was doing like 100 |
04:46
🔗
|
Coderjoe |
mmm |
04:47
🔗
|
Coderjoe |
someone just tried probing my ec2 instance's ssh port |
04:51
🔗
|
kennethre |
Ah I've missed you all |
04:51
🔗
|
tef |
hullo |
04:51
🔗
|
chronomex |
hrm? |
04:53
🔗
|
kennethre |
tef: so do you think any changes would need to be made to requests? |
04:53
🔗
|
tef |
well the only thing I can think of is to pass in something to the session, or making a session wrapper in warctools |
04:54
🔗
|
tef |
personally i'd dump a callback or a method in session to capture raw data, and either pass it in as an argument or subclass it |
04:55
🔗
|
tef |
so then it's just a def hook(request, response): write_warc(request.raw, response.raw) |
04:55
🔗
|
tef |
I don't think this is a common use case |
04:55
🔗
|
tef |
so i'd expect people to have a request dependency from using warctools.requests (say), rather than the other way around |
04:56
🔗
|
tef |
on the other hand, having a generic hook would be nice for logging/debugging |
04:56
🔗
|
tef |
a trace handler, really |
04:56
🔗
|
tef |
I don't know the requests api too well enough to make a concrete suggestion |
04:57
🔗
|
kennethre |
tef: that sounds overly complciated: ) |
04:57
🔗
|
kennethre |
*complicated |
04:57
🔗
|
kennethre |
keep it simple |
04:57
🔗
|
tef |
ok |
04:58
🔗
|
tef |
add a new event hook? that is fired with the request, response pair ? |
04:58
🔗
|
kennethre |
i don't understand why that's neccesary |
04:58
🔗
|
kennethre |
a warc records a single request right? or a bunch in one file? |
04:59
🔗
|
tef |
well the thing is to do it we have to pair off the trequest/response records in the warc |
04:59
🔗
|
tef |
with the WARC-Concurrent-To: header |
04:59
🔗
|
tef |
and *technically* you need to know the request to parse the response |
04:59
🔗
|
tef |
(because of HEAD requests) |
04:59
🔗
|
tef |
but you can ignore it for nearly everything because they're rare in practice |
05:00
🔗
|
kennethre |
what does WARC-Concurrent-To do? |
05:00
🔗
|
tef |
but we need to know which request goes with which response |
05:00
🔗
|
kennethre |
i still don't see how this is an issue :) |
05:00
🔗
|
tef |
http://secretvolcanobase.org/~tef/boingboing_sopa.warc |
05:00
🔗
|
tef |
if you look at the warc record |
05:01
🔗
|
tef |
I write the id of the request record in the response concurrent-to header |
05:01
🔗
|
tef |
and vice versa |
05:01
🔗
|
kennethre |
yes, i see a request and response |
05:01
🔗
|
tef |
so with the current event hooks, how would I match them up ? |
05:02
🔗
|
kennethre |
i still don't see how this is special |
05:02
🔗
|
kennethre |
i'm just not seeing something obvious or something :) |
05:05
🔗
|
kennethre |
you make a request, you get a response |
05:05
🔗
|
kennethre |
there's no disconnect |
05:05
🔗
|
Coderjoe |
without it, the request and response to that request would have to be back to back. with it, you can interleave |
05:05
🔗
|
Coderjoe |
if more than one request were to be made at the same time (or before the previous finished) |
05:06
🔗
|
kennethre |
you mean http/1.1 pipelining or app-level connection pooling? |
05:08
🔗
|
Coderjoe |
for example, a web browser is generally permitted to open two tcp connections to the server at the same time |
05:08
🔗
|
Coderjoe |
and issue requests concurrently on them |
05:08
🔗
|
kennethre |
yeah |
05:08
🔗
|
kennethre |
you could open 5000 with requests |
05:09
🔗
|
kennethre |
you don't not have a reference to the request though |
05:09
🔗
|
kennethre |
it's a non-issue |
05:09
🔗
|
kennethre |
responses have a request attribute |
05:09
🔗
|
tef |
argh wireless |
05:09
🔗
|
kennethre |
even if it didn't though, it wouldn't be an issue |
05:12
🔗
|
tef |
kennethre: no I mean, if you passed in two event hooks, one gets the request, one gets the response |
05:12
🔗
|
tef |
if you had multiple things or threads calling on the same session how do I know they match up |
05:12
🔗
|
tef |
ah shit |
05:12
🔗
|
tef |
if you had multiple things or threads calling on the same session how do I know they match up |
05:13
🔗
|
tef |
kennethre: oh doh |
05:13
🔗
|
tef |
I see |
05:13
🔗
|
kennethre |
this is super simple |
05:13
🔗
|
kennethre |
sessions are thread safe |
05:13
🔗
|
tef |
05:09 < kennethre> responses have a request attribute |
05:13
🔗
|
kennethre |
even if they didn't |
05:13
🔗
|
kennethre |
it wouldn't be an issue :) |
05:13
🔗
|
tef |
yes |
05:14
🔗
|
tef |
that was the bit I was missing |
05:14
🔗
|
tef |
but hey, it's 5am :-) |
05:14
🔗
|
kennethre |
haha |
05:14
🔗
|
tef |
although I did wake up at midday.... |
05:14
🔗
|
kennethre |
how's this |
05:14
🔗
|
kennethre |
i feel like i could write this thing in 2 hours |
05:14
🔗
|
tef |
so yeah I just write a nice hook |
05:14
🔗
|
kennethre |
no hook |
05:14
🔗
|
kennethre |
wrap |
05:14
🔗
|
kennethre |
out of band :) |
05:14
🔗
|
kennethre |
no need to mess with internals |
05:15
🔗
|
tef |
well I was figuring a wrapper that passed in a hook :v but I guess wrapping it also works |
05:15
🔗
|
tef |
and is probably simpler |
05:15
🔗
|
tef |
I think I got labeled the architect because I over engineer things, or that I'm terrible at testing. It's a job title of shame. |
05:15
🔗
|
kennethre |
you make a request, record what you need, get the response, record what you need |
05:16
🔗
|
kennethre |
put them in a giant pool if you need, out comes a warc |
05:16
🔗
|
kennethre |
i can even add this into requests itself |
05:16
🔗
|
kennethre |
i've throught about it |
05:16
🔗
|
kennethre |
response.warc |
05:16
🔗
|
kennethre |
tef: hehe, I *hate* overengineering :) |
05:16
🔗
|
tef |
kennethre: doesn't everyone :-) |
05:16
🔗
|
kennethre |
it's my mission in life to abolish it |
05:16
🔗
|
kennethre |
esp in python :) |
05:17
🔗
|
tef |
well it's my mission in life to avoid it, it's effort |
05:17
🔗
|
kennethre |
so a warc file could have 300 requests in it, right? |
05:17
🔗
|
tef |
well 300 request/responses |
05:17
🔗
|
kennethre |
(obviously) |
05:18
🔗
|
tef |
technically a warcrecord starts with a warcinfo record with an anvl seperated (read key:value\r\n iirc) format |
05:18
🔗
|
kennethre |
and if a server responds with a transfer encoding |
05:18
🔗
|
kennethre |
like gzip, does it ungzip it first? |
05:18
🔗
|
tef |
well, there are *two things* you can do |
05:18
🔗
|
tef |
you can just ungzip it and unchunk it (we do that because other software is shitty at parsing) |
05:18
🔗
|
tef |
or you can write a conversion record that is concurrent to the request/response record |
05:18
🔗
|
tef |
which can contain the original or the converted one . |
05:19
🔗
|
tef |
i'd probably have to doublecheck on that restriction but |
05:19
🔗
|
kennethre |
i'll have to look at the spec |
05:19
🔗
|
tef |
I think archive.org hates gzip encoding |
05:19
🔗
|
kennethre |
but i feel like i could do this pretty easily |
05:19
🔗
|
tef |
kennethre: it's an iso standard, but 1.00 is basically 0.17 |
05:19
🔗
|
kennethre |
even add some nice gevent magic |
05:19
🔗
|
tef |
yeah the thing is I realized it was easy but I figured i'd ask |
05:20
🔗
|
kennethre |
i guess it needs to crawl too, though? |
05:20
🔗
|
tef |
well I have some code to do basic crawling somewhere |
05:20
🔗
|
tef |
I wrote it in an afternoon as a code sample so it's terrible |
05:20
🔗
|
kennethre |
tef: i'd love to collaborate on this if you're in the mood for such a thing ;) |
05:20
🔗
|
tef |
sure |
05:20
🔗
|
kennethre |
tef: I've been meaning to get more involved with archiveteam |
05:21
🔗
|
tef |
I was gonna just hack this shit up in warctools and push it and hope someone here find it useful to augment the warc-get bulldozers |
05:21
🔗
|
kennethre |
I'm in :) |
05:21
🔗
|
tef |
thing is, writing warcs is *relatively* straight forward |
05:21
🔗
|
kennethre |
yeah |
05:21
🔗
|
tef |
you could probably do it yourself pretty easily |
05:21
🔗
|
kennethre |
yeah :) |
05:22
🔗
|
tef |
it's basically VERSION CR LF HEADERS* CRLF BODY CRLF |
05:22
🔗
|
kennethre |
if there's no aversion to dependencies, we could make it crazy awesome w/ gevent |
05:22
🔗
|
tef |
I have to do error correcting parsing on a bunch of weird formats |
05:22
🔗
|
kennethre |
i'll go to 11 :) |
05:22
🔗
|
tef |
well I have no problem with them |
05:23
🔗
|
tef |
but http://code.hanzoarchives.com/warc-tools/src/58d7d99406b0/hanzo/warctools/warc.py#cl-51 (btw now MIT) is how we write them |
05:23
🔗
|
tef |
the boingboing warc is pretty much all you need beyond a warcinfo record |
05:24
🔗
|
tef |
if you push something to git I can flesh out the warcwriting to make it standards compliant |
05:24
🔗
|
tef |
they also like to be gzipped, record by record and then catted together |
05:25
🔗
|
tef |
my github is tef fwiw |
05:26
🔗
|
tef |
cos I am totally in a mood to code before bedtime, and we're at the edge of a release cycle so I have to not break work code today |
05:26
🔗
|
kennethre |
awesome |
05:26
🔗
|
kennethre |
I'm a member of the archiveteam org |
05:26
🔗
|
kennethre |
might push something up there |
05:26
🔗
|
kennethre |
we'll see |
05:26
🔗
|
kennethre |
depends on my mood |
05:26
🔗
|
kennethre |
i have 800 projects going on right now |
05:26
🔗
|
tef |
heh |
05:26
🔗
|
kennethre |
this one sounce nice and quick though, so i might kick it out ;) |
05:27
🔗
|
tef |
well I am interested in making requests + warc output happen |
05:27
🔗
|
kennethre |
are warcs binary? |
05:27
🔗
|
kennethre |
utf8? any encoding |
05:27
🔗
|
tef |
well, technically the body is in binary |
05:27
🔗
|
tef |
the headers *can* be utf-8 |
05:27
🔗
|
kennethre |
headers are supposed to be latin1 |
05:27
🔗
|
tef |
but most of the values tend to be ascii anyway |
05:27
🔗
|
kennethre |
grr |
05:27
🔗
|
tef |
the headers of a warc file can be utf-8 |
05:27
🔗
|
kennethre |
but the file itself |
05:28
🔗
|
kennethre |
is binary |
05:28
🔗
|
tef |
the http message is treated as an octet-stream |
05:28
🔗
|
tef |
yes |
05:28
🔗
|
kennethre |
with uf8 splashed in |
05:28
🔗
|
kennethre |
gotcha |
05:28
🔗
|
kennethre |
perfect |
05:28
🔗
|
tef |
somewhat but you don't see it in practice |
05:28
🔗
|
tef |
techincally headers can also be mime/quoted printable but I have never seen it |
05:29
🔗
|
tef |
so yeah - warc headers/values are nominally utf-8 but it's best to write ascii - so % encode the url in Target-URI - and the http message is treated as bytes |
05:30
🔗
|
kennethre |
excellent |
05:30
🔗
|
tef |
and the newline for warc-records is CRLF |
05:30
🔗
|
kennethre |
sounds quite strait forward |
05:31
🔗
|
tef |
I have read the iso standard |
05:31
🔗
|
tef |
well parsing it is more annoying than writing nice ones :-) |
05:31
🔗
|
kennethre |
there should be a validator :) |
05:31
🔗
|
tef |
there is warctools |
05:31
🔗
|
kennethre |
i'm surprised you didn't just use HAL i think? |
05:31
🔗
|
kennethre |
what is it |
05:31
🔗
|
tef |
on pypi which has a warc2warc.py and a warcvalid.py |
05:31
🔗
|
kennethre |
the one chrome uses |
05:31
🔗
|
tef |
the chrome one has no headers |
05:31
🔗
|
kennethre |
HAR |
05:31
🔗
|
tef |
or didn't |
05:31
🔗
|
kennethre |
wtf really? |
05:32
🔗
|
tef |
this is the one heritrix produces, and people are using in practice in archives |
05:32
🔗
|
kennethre |
it does |
05:32
🔗
|
tef |
(warcs) |
05:32
🔗
|
kennethre |
but it's ugly |
05:32
🔗
|
kennethre |
yeah |
05:32
🔗
|
kennethre |
oh i know it's standard now |
05:32
🔗
|
tef |
warcs have a bunch of stuff for archivists (metadata, conversions, capture information) |
05:33
🔗
|
kennethre |
looks like HAR doesn't have the body!? |
05:33
🔗
|
tef |
there was one I saw which was essentially a http message in json to avoid having to parse http |
05:33
🔗
|
kennethre |
yes |
05:33
🔗
|
kennethre |
that's the one |
05:34
🔗
|
tef |
thing is, when we're doing the compliance thing we sort of tout the whole 'you can see what went over the wire' thing |
05:34
🔗
|
tef |
but really I didn't see the point much (also json unicode heh) |
05:34
🔗
|
tef |
sometimes the interesting bits of http messages is the raw encoding and how it's broken |
05:35
🔗
|
kennethre |
yet you don't store it transfer-encoded :) |
05:35
🔗
|
tef |
ssssh |
05:35
🔗
|
tef |
well we could but technically an upstream proxy is allowed to perform that transformation |
05:35
🔗
|
tef |
and if it breaks we leave the original in |
05:36
🔗
|
kennethre |
ah |
05:36
🔗
|
kennethre |
fair enough |
05:36
🔗
|
kennethre |
yep, that's what requests does too |
05:36
🔗
|
kennethre |
awesome |
05:36
🔗
|
kennethre |
this is going to be great |
05:37
🔗
|
tef |
well if only there was a collaborative editor for python online I would be so happy |
05:37
🔗
|
tef |
actually github does let me edit things in situ on the repo I just recalled |
05:37
🔗
|
tef |
anyway, I am wondering what I can start writing |
05:37
🔗
|
tef |
which might be of any help |
05:38
🔗
|
tef |
but mostly it seems I am being an oracle for the iso warc standard |
05:38
🔗
|
kennethre |
a feature wish list ;) |
05:38
🔗
|
tef |
ok |
05:44
🔗
|
tef |
https://notes.typo3.org/p/K2To4zZyGy |
05:45
🔗
|
tef |
doing that with emphasis on correct iso output |
05:45
🔗
|
tef |
ugh sometimes being a standards weenie is useful I guess |
05:48
🔗
|
kennethre |
hey, requests still doesn't do a POST/GET redirect |
05:48
🔗
|
kennethre |
rfc till death! |
05:52
🔗
|
tef |
also utc, utf-8 or death |
05:53
🔗
|
kennethre |
i like this guy |
05:53
🔗
|
tef |
I have a friend in localization/internationalization |
05:53
🔗
|
tef |
we have similar chants |
05:53
🔗
|
tef |
including iso date times or death |
05:54
🔗
|
tef |
unfortunately reality has a nasty way of making us deal with encoding issues |
06:07
🔗
|
tef |
anyway, that should be enough to go on kennethre ? |
06:13
🔗
|
tef |
https://notes.typo3.org/p/K2To4zZyGy |
06:15
🔗
|
kennethre |
tef: excellent, that sound be perfect, thanks man :) |
06:15
🔗
|
kennethre |
tef: I shall commence :) |
06:16
🔗
|
tef |
cool |
06:17
🔗
|
tef |
I was this close to trying to add it myself |
06:17
🔗
|
tef |
well i've forked requests and checked it out |
06:17
🔗
|
tef |
hmm |
06:18
🔗
|
tef |
I can write somethign to do link extraction though |
06:26
🔗
|
tef |
holy shit |
06:26
🔗
|
tef |
so i've hacked up a crawler to use requests |
06:26
🔗
|
tef |
three line change |
06:27
🔗
|
tef |
https://github.com/tef/codesamples/tree/master/pyget |
06:30
🔗
|
tef |
https://github.com/tef/crawler even (now) |
06:33
🔗
|
kennethre |
haha excellent |
06:33
🔗
|
kennethre |
well that's simple enough |
06:35
🔗
|
tef |
yeah now it just needs warc output to be hacked in |
06:35
🔗
|
tef |
thing is, as much as it seems to be duplicating warc-wget I kind appreciate it being in python as to be *configurable* |
06:36
🔗
|
tef |
now you just need to make it report to a generic tracker and upload directly to archive.org s3 :v |
06:36
🔗
|
tef |
s/you/me/etc/ |
06:36
🔗
|
tef |
btw: it made it faster, scarily faster |
06:37
🔗
|
kennethre |
hehe |
06:37
🔗
|
kennethre |
yeah there's no reason it shouldn't be in python really |
06:38
🔗
|
kennethre |
compiling warc-wget was far harder than installing a python dep |
06:38
🔗
|
kennethre |
and we can script that all out |
06:39
🔗
|
tef |
yeah |
06:39
🔗
|
tef |
well shall i add you to crawler ? |
06:40
🔗
|
tef |
it's not pretty but it is enough to consume requests and write warcs |
06:41
🔗
|
tef |
i'm just tidying up the html extractor to make it more wget-like |
06:43
🔗
|
kennethre |
awesome |
06:43
🔗
|
kennethre |
nah i'll make my thing |
06:43
🔗
|
kennethre |
and you can use it in your crawler |
06:46
🔗
|
tef |
awesome |
06:47
🔗
|
tef |
thing is I could always pull in another dep on warctools, unpack the raw bits from requests and dump them into a warc |
06:48
🔗
|
tef |
I don't mind either way |
06:48
🔗
|
tef |
although the way that involves less effort for me is somewhat preferable :v |
07:11
🔗
|
tef |
now I want to rewrite it entirely |
07:58
🔗
|
tef |
kennethre: fwiw I think it might be easier for me to just write warcs |
07:58
🔗
|
tef |
you've shown me the light with requests |
08:21
🔗
|
Coderjoe |
ugh |
08:21
🔗
|
Coderjoe |
h264 does not belong in avi |
08:23
🔗
|
Coderjoe |
(just ran across a movie on IA that the "Cinepack" file (which is their silly way of saying "avi", i guess) is h264+mp3 in avi |
08:24
🔗
|
tef |
kennethre: one minor issue. there is no request.raw :3 |
08:49
🔗
|
tef |
well i've got it making pseudo warcs |
08:50
🔗
|
tef |
https://github.com/tef/crawler/tree/master/scraper |
08:51
🔗
|
tef |
but it has to reconstruct the http message from the request/response objects |
09:56
🔗
|
dcmorton |
hey all, not sure if anybody is interested in this, but here's some stats on the VM that i've currently got downloading MobileMe.. http://networkwhisperer.com/cacti/ |
09:56
🔗
|
dcmorton |
been averaging about 250 gigs up/down a day |
10:30
🔗
|
Coderjoe |
uhoh |
10:30
🔗
|
Coderjoe |
http://www.theinquirer.net/inquirer/news/2144705/yahoo-sheds-chairman-directors |
10:31
🔗
|
Coderjoe |
another big change happening at the top over at yahoo |
10:32
🔗
|
ersi |
I'm axin' up axin' up, axin' up |
10:32
🔗
|
ersi |
'cause my shareholders taught me good |
10:59
🔗
|
chronomex |
lubin' up the axe |
11:07
🔗
|
Coderjoe |
hide your flickr, hide your ... is anything left of delicious? |
13:25
🔗
|
godane |
SketchCow: http://good.net/dl/bd |
13:26
🔗
|
godane |
this has a lot of videos from all the hacker cons |
13:26
🔗
|
godane |
and your bbs documentary |
13:27
🔗
|
godane |
it also has the convention.cdroms sections |
13:28
🔗
|
godane |
Also i was wondering |
13:30
🔗
|
godane |
does archive.org do some sort of dedupllication? |
13:31
🔗
|
godane |
i just think it has too since people my upload the same video more then once |
13:33
🔗
|
ersi |
I bet they do deduplication and replication |
13:33
🔗
|
ersi |
I mean, considering the amounts of data they have/get |
15:42
🔗
|
Schbirid |
SketchCow: is David R. Foley on your list for the arcade docu? |
15:43
🔗
|
Schbirid |
emijrp: let's compare http://enjoys.it/jamendo/jamendo-archive-tcs_20120126.txt (some mb text) |
16:37
🔗
|
Schbirid |
shite, any one got an idea what happened to klov.com? |
16:38
🔗
|
Schbirid |
http://www.arcade-museum.com/ is dead for me |
17:30
🔗
|
yipdw |
ahahaha |
17:30
🔗
|
yipdw |
http://www.nytimes.com/2012/02/08/opinion/what-wikipedia-wont-tell-you.html?_r=1 |
17:31
🔗
|
yipdw |
it's cute how Cary Sherman assumes the US owns the Internet |
17:31
🔗
|
yipdw |
wait, no, it's not cute |
17:31
🔗
|
yipdw |
it's dangerously misguided |
17:46
🔗
|
emijrp |
US owns the Internet. |
17:50
🔗
|
emijrp |
US = Archive Team. Of course. |
22:13
🔗
|
arrith |
Magnet-hashes for all torrents on The Pirate Bay: 164 MB (thepiratebay.se) |
22:13
🔗
|
arrith |
http://news.ycombinator.com/item?id=3568393 |
22:13
🔗
|
arrith |
https://thepiratebay.se/torrent/7016365/The_whole_Pirate_Bay_magnet_archive |
22:13
🔗
|
arrith |
magnet:?xt=urn:btih:938802790a385c49307f34cca4c30f80b03df59c&dn=The+whole+Pirate+Bay+magnet+archive&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.publicbt.com%3A80&tr=udp%3A%2F%2Ftracker.ccc.de%3A80 |
22:13
🔗
|
arrith |
from the description: "The only thing that's strange is that I found out only about 1.5 millions of torrents, while there is something about 4 millions of torrents in TPB footer. However, I think I am correct and TPB footer is not ;)" |
22:14
🔗
|
arrith |
not knowing his collection method of the magnet links it's hard to figure out where/if he went wrong without doing my own |
22:15
🔗
|
arrith |
actual content of the torrent is compressed and is only 90 MiB btw |
22:33
🔗
|
Nemo_bis |
arrith, I've also set up a cronjob for this: http://www.archive.org/details/publicbt.com |
22:33
🔗
|
Nemo_bis |
and that's 3 millions hashes |
22:34
🔗
|
Nemo_bis |
but doesn't contain titles |
22:34
🔗
|
arrith |
Nemo_bis: ah that's pretty fancy |
22:35
🔗
|
Nemo_bis |
well, nothing special |
22:35
🔗
|
arrith |
Nemo_bis: have you checked if any hashes get removed from newer grabs? |
22:35
🔗
|
Nemo_bis |
torrent sites are a quite stupid thing after all |
22:35
🔗
|
Nemo_bis |
arrith, obviously not! |
22:35
🔗
|
Nemo_bis |
that's something I want others to do |
22:35
🔗
|
arrith |
ah |
22:35
🔗
|
Nemo_bis |
otherwise I wouldn't upload it to IA :-p |
22:36
🔗
|
arrith |
heh |
22:36
🔗
|
Nemo_bis |
in a few years, researchers will have a lot of data to work on in that item :D |
22:36
🔗
|
arrith |
indeed |
22:36
🔗
|
arrith |
Nemo_bis: is what you have already grabbed been posted on IA? |
22:36
🔗
|
Nemo_bis |
arrith, what do you mean? |
22:37
🔗
|
arrith |
Nemo_bis: i got the impression you were or were planning to upload it to IA |
22:37
🔗
|
Nemo_bis |
also, you should upload that TPB archive to IA |
22:37
🔗
|
Nemo_bis |
arrith, I've already uploaded everything |
22:38
🔗
|
Nemo_bis |
it's just the list of hashes of PBt as published on their website |
22:38
🔗
|
arrith |
Nemo_bis: ah. IA makes stuff public eventually right? has to be curated or something first though i guess |
22:38
🔗
|
Nemo_bis |
it's already public, isn't it? |
22:38
🔗
|
Nemo_bis |
unless some sysadmin has "censored" it |
22:39
🔗
|
Nemo_bis |
there's nothing to hide there, it's just a list of hashes |
22:39
🔗
|
arrith |
the uploaded IA stuff or publicbt's archives? |
22:39
🔗
|
Nemo_bis |
doesn't mean anything in itself |
22:39
🔗
|
Nemo_bis |
that item |
22:39
🔗
|
Nemo_bis |
(as they explain in their home page) |
22:39
🔗
|
arrith |
since i'm pretty sure publicbt's stuff changes, so one would need to get backdated version |
22:39
🔗
|
arrith |
ah |
22:40
🔗
|
arrith |
Nemo_bis: well i'm not getting anything for http://www.archive.org/search.php?query=all.txt.bz2 |
22:40
🔗
|
Nemo_bis |
you could use those hashes to get all torrents or all info about what they actually contain, but that's not something I'm going to do :D |
22:40
🔗
|
Nemo_bis |
I do only the easy stuff |
22:41
🔗
|
arrith |
right |
22:41
🔗
|
Nemo_bis |
arrith, I doubt you can search filenames |
22:41
🔗
|
arrith |
Nemo_bis: any idea how one might find them? |
22:41
🔗
|
Nemo_bis |
arrith, find what? |
22:41
🔗
|
Nemo_bis |
brb |
22:43
🔗
|
DFJustin |
http://ia700807.us.archive.org/11/items/publicbt.com/ |
22:44
🔗
|
arrith |
nice |
22:44
🔗
|
arrith |
ty DFJustin |
22:44
🔗
|
DFJustin |
Nemo_bis: set the item mediatype to "data" or "web" so it shows the file links |
22:45
🔗
|
Nemo_bis |
DFJustin, am I allowed to? |
22:45
🔗
|
Nemo_bis |
isn't that for privileged users |
22:46
🔗
|
DFJustin |
the permissions are actually available for anyone, just the web interface doesn't let you |
22:47
🔗
|
DFJustin |
you can using s3 http://www.archive.org/help/abouts3.txt |
22:47
🔗
|
Nemo_bis |
hm |
22:47
🔗
|
DFJustin |
or I just use firebug to take the readonly attribute off the textbox |
22:47
🔗
|
DFJustin |
:D |
22:47
🔗
|
Nemo_bis |
last time I tried, I failed |
22:47
🔗
|
Nemo_bis |
perhaps because I tried to upload to some other collecion |
22:47
🔗
|
DFJustin |
yeah the collection stuff is locked down |
22:47
🔗
|
Nemo_bis |
do you mean, change mediatype and leave in that collection |
22:47
🔗
|
Nemo_bis |
ah ok |
22:49
🔗
|
Nemo_bis |
DFJustin, do I have to respecify all metadata? |
22:49
🔗
|
DFJustin |
dunno it kind of sounds like it but I haven't tried on an existing item |
22:49
🔗
|
arrith |
heh have to tweak their page to get stuff to display. i wonder why they have it set locked down like that |
22:50
🔗
|
Nemo_bis |
which page? |
22:50
🔗
|
DFJustin |
for the text mediatype it makes sense since it shows the page reader interface and stuff, but if you upload something that's not a book, it doesn't convert and thus you see nothing |
22:51
🔗
|
Nemo_bis |
DFJustin, I never manage to follow the example in the doc |
22:52
🔗
|
Nemo_bis |
I get an HTML page with "A request of the requested method PUT requires a valid Content-length." |
22:52
🔗
|
arrith |
yipdw: when you're around: can you think of any better way to get all magnet links on a torrent site, say the pirate bay, without basically grabbing the full html of each page? |
22:52
🔗
|
arrith |
also anyone else with thoughts on that ^ |
22:53
🔗
|
Nemo_bis |
info_hash is often exposed on the page to let search engines index it |
22:53
🔗
|
DFJustin |
the collection stuff puzzles me though, they have stuff like http://www.archive.org/details/open_source_software that they don't actually allow the unwashed to upload to (although the gatekeeping is sufficiently lax to allow random metadata-less arabic stuff) |
22:53
🔗
|
Nemo_bis |
who can upload there? |
22:54
🔗
|
DFJustin |
I think they have to grant permission on an account-by-account basis |
22:54
🔗
|
arrith |
Nemo_bis: ah right, for other sites yeah. though i'm not sure tpb exposes the info_hash on pages. if it is it's well hidden and google can't see it since i've never gotten a tpb result for googling a userhash |
22:55
🔗
|
DFJustin |
I tried emailing about it a while back but got a useless form letter |
22:55
🔗
|
Nemo_bis |
heh |
22:55
🔗
|
Nemo_bis |
well, I just put tags which make sense |
22:56
🔗
|
Nemo_bis |
and when there's too much stuff to let around, someone with permissions gets sick of it and moves to the correct collection |
22:56
🔗
|
Nemo_bis |
not my problem |
22:56
🔗
|
DFJustin |
like, somehow this guy has the magic bits http://www.archive.org/details/homaled |
22:57
🔗
|
Nemo_bis |
... |
22:58
🔗
|
yipdw |
arrith: unless they expose it some other way, grabbing the HTML of each page is the best you can do |
22:58
🔗
|
DFJustin |
anyway you can just poke jason when he comes by |
22:59
🔗
|
arrith |
yipdw: yeah i guess so |
22:59
🔗
|
yipdw |
I'm not familiar with what services TPB provides |
22:59
🔗
|
yipdw |
re: magnet URL tracking |
22:59
🔗
|
yipdw |
unfortunately |
22:59
🔗
|
yipdw |
so nothing is coming to mind atm |
22:59
🔗
|
yipdw |
and I don't want to access TPB from work :P |
22:59
🔗
|
arrith |
yipdw: they used to have a tracker but yeah afaik they currently serve up torrent files and list magnet data |
22:59
🔗
|
arrith |
yipdw: haha. np |
23:01
🔗
|
emijrp |
the biggest torrent in that archive is the geocities one |
23:01
🔗
|
emijrp |
the PATCHED one |
23:04
🔗
|
arrith |
neat |
23:04
🔗
|
arrith |
emijrp, balrog: btw seems you both were hit by a netsplit and may want to check the public logs if you're interested in magnet link archive stuff |
23:05
🔗
|
balrog |
arrith: it wasn't a netsplit here |
23:05
🔗
|
balrog |
my phone battery gave out due to cold weather :p |
23:06
🔗
|
arrith |
ouch |
23:07
🔗
|
balrog |
went from 12% to 0, like that |
23:53
🔗
|
emijrp |
arrith: http://www.archiveteam.org/index.php?title=The_Pirate_Bay |
23:55
🔗
|
arrith |
emijrp: ty |
23:56
🔗
|
emijrp |
time to update |