#archiveteam 2012-02-08,Wed

↑back Search

Time Nickname Message
01:16 🔗 godane SketchCow: I'm backing up defcon.org
01:17 🔗 godane its only 3.2gb
01:17 🔗 godane i'm not backing up the top domains of defcon.org like media.defcon.org
04:12 🔗 yipdw heh
04:12 🔗 yipdw 7008 yipdw 20 0 380m 81m 11m S 390 0.7 10:37.05 rbx
04:12 🔗 yipdw that's what I like to see, 390% CPU utilization from Ruby
04:12 🔗 yipdw almost all cores in use :D
04:13 🔗 chronomex awwwwyeah
04:32 🔗 tef yipdw: what are you using to archive it ?
04:33 🔗 yipdw tef: archive what?
04:33 🔗 tef ah shit I misread scrollback
04:33 🔗 tef was assuming that was what ruby was up to
04:33 🔗 tef christ how did I get ops?
04:33 🔗 yipdw oh
04:33 🔗 yipdw heh
04:34 🔗 yipdw no, it's crawling a graph
04:35 🔗 Coderjoe 149,566,088 bytes before, 215,611,184 after
04:35 🔗 Coderjoe you've nearly doubled the redis memory usage
04:35 🔗 yipdw I stopped writing to it
04:36 🔗 Coderjoe oh
04:36 🔗 yipdw I'm using my home machine right now because it can give me better performance
04:36 🔗 yipdw at some point I'll resync and move the crawler back
04:37 🔗 yipdw and by "better" I mean that I can get on the order of 1k work items/sec processed
04:37 🔗 yipdw the m1.small EC2 instance was doing like 100
04:46 🔗 Coderjoe mmm
04:47 🔗 Coderjoe someone just tried probing my ec2 instance's ssh port
04:51 🔗 kennethre Ah I've missed you all
04:51 🔗 tef hullo
04:51 🔗 chronomex hrm?
04:53 🔗 kennethre tef: so do you think any changes would need to be made to requests?
04:53 🔗 tef well the only thing I can think of is to pass in something to the session, or making a session wrapper in warctools
04:54 🔗 tef personally i'd dump a callback or a method in session to capture raw data, and either pass it in as an argument or subclass it
04:55 🔗 tef so then it's just a def hook(request, response): write_warc(request.raw, response.raw)
04:55 🔗 tef I don't think this is a common use case
04:55 🔗 tef so i'd expect people to have a request dependency from using warctools.requests (say), rather than the other way around
04:56 🔗 tef on the other hand, having a generic hook would be nice for logging/debugging
04:56 🔗 tef a trace handler, really
04:56 🔗 tef I don't know the requests api too well enough to make a concrete suggestion
04:57 🔗 kennethre tef: that sounds overly complciated: )
04:57 🔗 kennethre *complicated
04:57 🔗 kennethre keep it simple
04:57 🔗 tef ok
04:58 🔗 tef add a new event hook? that is fired with the request, response pair ?
04:58 🔗 kennethre i don't understand why that's neccesary
04:58 🔗 kennethre a warc records a single request right? or a bunch in one file?
04:59 🔗 tef well the thing is to do it we have to pair off the trequest/response records in the warc
04:59 🔗 tef with the WARC-Concurrent-To: header
04:59 🔗 tef and *technically* you need to know the request to parse the response
04:59 🔗 tef (because of HEAD requests)
04:59 🔗 tef but you can ignore it for nearly everything because they're rare in practice
05:00 🔗 kennethre what does WARC-Concurrent-To do?
05:00 🔗 tef but we need to know which request goes with which response
05:00 🔗 kennethre i still don't see how this is an issue :)
05:00 🔗 tef http://secretvolcanobase.org/~tef/boingboing_sopa.warc
05:00 🔗 tef if you look at the warc record
05:01 🔗 tef I write the id of the request record in the response concurrent-to header
05:01 🔗 tef and vice versa
05:01 🔗 kennethre yes, i see a request and response
05:01 🔗 tef so with the current event hooks, how would I match them up ?
05:02 🔗 kennethre i still don't see how this is special
05:02 🔗 kennethre i'm just not seeing something obvious or something :)
05:05 🔗 kennethre you make a request, you get a response
05:05 🔗 kennethre there's no disconnect
05:05 🔗 Coderjoe without it, the request and response to that request would have to be back to back. with it, you can interleave
05:05 🔗 Coderjoe if more than one request were to be made at the same time (or before the previous finished)
05:06 🔗 kennethre you mean http/1.1 pipelining or app-level connection pooling?
05:08 🔗 Coderjoe for example, a web browser is generally permitted to open two tcp connections to the server at the same time
05:08 🔗 Coderjoe and issue requests concurrently on them
05:08 🔗 kennethre yeah
05:08 🔗 kennethre you could open 5000 with requests
05:09 🔗 kennethre you don't not have a reference to the request though
05:09 🔗 kennethre it's a non-issue
05:09 🔗 kennethre responses have a request attribute
05:09 🔗 tef argh wireless
05:09 🔗 kennethre even if it didn't though, it wouldn't be an issue
05:12 🔗 tef kennethre: no I mean, if you passed in two event hooks, one gets the request, one gets the response
05:12 🔗 tef if you had multiple things or threads calling on the same session how do I know they match up
05:12 🔗 tef ah shit
05:12 🔗 tef if you had multiple things or threads calling on the same session how do I know they match up
05:13 🔗 tef kennethre: oh doh
05:13 🔗 tef I see
05:13 🔗 kennethre this is super simple
05:13 🔗 kennethre sessions are thread safe
05:13 🔗 tef 05:09 < kennethre> responses have a request attribute
05:13 🔗 kennethre even if they didn't
05:13 🔗 kennethre it wouldn't be an issue :)
05:13 🔗 tef yes
05:14 🔗 tef that was the bit I was missing
05:14 🔗 tef but hey, it's 5am :-)
05:14 🔗 kennethre haha
05:14 🔗 tef although I did wake up at midday....
05:14 🔗 kennethre how's this
05:14 🔗 kennethre i feel like i could write this thing in 2 hours
05:14 🔗 tef so yeah I just write a nice hook
05:14 🔗 kennethre no hook
05:14 🔗 kennethre wrap
05:14 🔗 kennethre out of band :)
05:14 🔗 kennethre no need to mess with internals
05:15 🔗 tef well I was figuring a wrapper that passed in a hook :v but I guess wrapping it also works
05:15 🔗 tef and is probably simpler
05:15 🔗 tef I think I got labeled the architect because I over engineer things, or that I'm terrible at testing. It's a job title of shame.
05:15 🔗 kennethre you make a request, record what you need, get the response, record what you need
05:16 🔗 kennethre put them in a giant pool if you need, out comes a warc
05:16 🔗 kennethre i can even add this into requests itself
05:16 🔗 kennethre i've throught about it
05:16 🔗 kennethre response.warc
05:16 🔗 kennethre tef: hehe, I *hate* overengineering :)
05:16 🔗 tef kennethre: doesn't everyone :-)
05:16 🔗 kennethre it's my mission in life to abolish it
05:16 🔗 kennethre esp in python :)
05:17 🔗 tef well it's my mission in life to avoid it, it's effort
05:17 🔗 kennethre so a warc file could have 300 requests in it, right?
05:17 🔗 tef well 300 request/responses
05:17 🔗 kennethre (obviously)
05:18 🔗 tef technically a warcrecord starts with a warcinfo record with an anvl seperated (read key:value\r\n iirc) format
05:18 🔗 kennethre and if a server responds with a transfer encoding
05:18 🔗 kennethre like gzip, does it ungzip it first?
05:18 🔗 tef well, there are *two things* you can do
05:18 🔗 tef you can just ungzip it and unchunk it (we do that because other software is shitty at parsing)
05:18 🔗 tef or you can write a conversion record that is concurrent to the request/response record
05:18 🔗 tef which can contain the original or the converted one .
05:19 🔗 tef i'd probably have to doublecheck on that restriction but
05:19 🔗 kennethre i'll have to look at the spec
05:19 🔗 tef I think archive.org hates gzip encoding
05:19 🔗 kennethre but i feel like i could do this pretty easily
05:19 🔗 tef kennethre: it's an iso standard, but 1.00 is basically 0.17
05:19 🔗 kennethre even add some nice gevent magic
05:19 🔗 tef yeah the thing is I realized it was easy but I figured i'd ask
05:20 🔗 kennethre i guess it needs to crawl too, though?
05:20 🔗 tef well I have some code to do basic crawling somewhere
05:20 🔗 tef I wrote it in an afternoon as a code sample so it's terrible
05:20 🔗 kennethre tef: i'd love to collaborate on this if you're in the mood for such a thing ;)
05:20 🔗 tef sure
05:20 🔗 kennethre tef: I've been meaning to get more involved with archiveteam
05:21 🔗 tef I was gonna just hack this shit up in warctools and push it and hope someone here find it useful to augment the warc-get bulldozers
05:21 🔗 kennethre I'm in :)
05:21 🔗 tef thing is, writing warcs is *relatively* straight forward
05:21 🔗 kennethre yeah
05:21 🔗 tef you could probably do it yourself pretty easily
05:21 🔗 kennethre yeah :)
05:22 🔗 tef it's basically VERSION CR LF HEADERS* CRLF BODY CRLF
05:22 🔗 kennethre if there's no aversion to dependencies, we could make it crazy awesome w/ gevent
05:22 🔗 tef I have to do error correcting parsing on a bunch of weird formats
05:22 🔗 kennethre i'll go to 11 :)
05:22 🔗 tef well I have no problem with them
05:23 🔗 tef but http://code.hanzoarchives.com/warc-tools/src/58d7d99406b0/hanzo/warctools/warc.py#cl-51 (btw now MIT) is how we write them
05:23 🔗 tef the boingboing warc is pretty much all you need beyond a warcinfo record
05:24 🔗 tef if you push something to git I can flesh out the warcwriting to make it standards compliant
05:24 🔗 tef they also like to be gzipped, record by record and then catted together
05:25 🔗 tef my github is tef fwiw
05:26 🔗 tef cos I am totally in a mood to code before bedtime, and we're at the edge of a release cycle so I have to not break work code today
05:26 🔗 kennethre awesome
05:26 🔗 kennethre I'm a member of the archiveteam org
05:26 🔗 kennethre might push something up there
05:26 🔗 kennethre we'll see
05:26 🔗 kennethre depends on my mood
05:26 🔗 kennethre i have 800 projects going on right now
05:26 🔗 tef heh
05:26 🔗 kennethre this one sounce nice and quick though, so i might kick it out ;)
05:27 🔗 tef well I am interested in making requests + warc output happen
05:27 🔗 kennethre are warcs binary?
05:27 🔗 kennethre utf8? any encoding
05:27 🔗 tef well, technically the body is in binary
05:27 🔗 tef the headers *can* be utf-8
05:27 🔗 kennethre headers are supposed to be latin1
05:27 🔗 tef but most of the values tend to be ascii anyway
05:27 🔗 kennethre grr
05:27 🔗 tef the headers of a warc file can be utf-8
05:27 🔗 kennethre but the file itself
05:28 🔗 kennethre is binary
05:28 🔗 tef the http message is treated as an octet-stream
05:28 🔗 tef yes
05:28 🔗 kennethre with uf8 splashed in
05:28 🔗 kennethre gotcha
05:28 🔗 kennethre perfect
05:28 🔗 tef somewhat but you don't see it in practice
05:28 🔗 tef techincally headers can also be mime/quoted printable but I have never seen it
05:29 🔗 tef so yeah - warc headers/values are nominally utf-8 but it's best to write ascii - so % encode the url in Target-URI - and the http message is treated as bytes
05:30 🔗 kennethre excellent
05:30 🔗 tef and the newline for warc-records is CRLF
05:30 🔗 kennethre sounds quite strait forward
05:31 🔗 tef I have read the iso standard
05:31 🔗 tef well parsing it is more annoying than writing nice ones :-)
05:31 🔗 kennethre there should be a validator :)
05:31 🔗 tef there is warctools
05:31 🔗 kennethre i'm surprised you didn't just use HAL i think?
05:31 🔗 kennethre what is it
05:31 🔗 tef on pypi which has a warc2warc.py and a warcvalid.py
05:31 🔗 kennethre the one chrome uses
05:31 🔗 tef the chrome one has no headers
05:31 🔗 kennethre HAR
05:31 🔗 tef or didn't
05:31 🔗 kennethre wtf really?
05:32 🔗 tef this is the one heritrix produces, and people are using in practice in archives
05:32 🔗 kennethre it does
05:32 🔗 tef (warcs)
05:32 🔗 kennethre but it's ugly
05:32 🔗 kennethre yeah
05:32 🔗 kennethre oh i know it's standard now
05:32 🔗 tef warcs have a bunch of stuff for archivists (metadata, conversions, capture information)
05:33 🔗 kennethre looks like HAR doesn't have the body!?
05:33 🔗 tef there was one I saw which was essentially a http message in json to avoid having to parse http
05:33 🔗 kennethre yes
05:33 🔗 kennethre that's the one
05:34 🔗 tef thing is, when we're doing the compliance thing we sort of tout the whole 'you can see what went over the wire' thing
05:34 🔗 tef but really I didn't see the point much (also json unicode heh)
05:34 🔗 tef sometimes the interesting bits of http messages is the raw encoding and how it's broken
05:35 🔗 kennethre yet you don't store it transfer-encoded :)
05:35 🔗 tef ssssh
05:35 🔗 tef well we could but technically an upstream proxy is allowed to perform that transformation
05:35 🔗 tef and if it breaks we leave the original in
05:36 🔗 kennethre ah
05:36 🔗 kennethre fair enough
05:36 🔗 kennethre yep, that's what requests does too
05:36 🔗 kennethre awesome
05:36 🔗 kennethre this is going to be great
05:37 🔗 tef well if only there was a collaborative editor for python online I would be so happy
05:37 🔗 tef actually github does let me edit things in situ on the repo I just recalled
05:37 🔗 tef anyway, I am wondering what I can start writing
05:37 🔗 tef which might be of any help
05:38 🔗 tef but mostly it seems I am being an oracle for the iso warc standard
05:38 🔗 kennethre a feature wish list ;)
05:38 🔗 tef ok
05:44 🔗 tef https://notes.typo3.org/p/K2To4zZyGy
05:45 🔗 tef doing that with emphasis on correct iso output
05:45 🔗 tef ugh sometimes being a standards weenie is useful I guess
05:48 🔗 kennethre hey, requests still doesn't do a POST/GET redirect
05:48 🔗 kennethre rfc till death!
05:52 🔗 tef also utc, utf-8 or death
05:53 🔗 kennethre i like this guy
05:53 🔗 tef I have a friend in localization/internationalization
05:53 🔗 tef we have similar chants
05:53 🔗 tef including iso date times or death
05:54 🔗 tef unfortunately reality has a nasty way of making us deal with encoding issues
06:07 🔗 tef anyway, that should be enough to go on kennethre ?
06:13 🔗 tef https://notes.typo3.org/p/K2To4zZyGy
06:15 🔗 kennethre tef: excellent, that sound be perfect, thanks man :)
06:15 🔗 kennethre tef: I shall commence :)
06:16 🔗 tef cool
06:17 🔗 tef I was this close to trying to add it myself
06:17 🔗 tef well i've forked requests and checked it out
06:17 🔗 tef hmm
06:18 🔗 tef I can write somethign to do link extraction though
06:26 🔗 tef holy shit
06:26 🔗 tef so i've hacked up a crawler to use requests
06:26 🔗 tef three line change
06:27 🔗 tef https://github.com/tef/codesamples/tree/master/pyget
06:30 🔗 tef https://github.com/tef/crawler even (now)
06:33 🔗 kennethre haha excellent
06:33 🔗 kennethre well that's simple enough
06:35 🔗 tef yeah now it just needs warc output to be hacked in
06:35 🔗 tef thing is, as much as it seems to be duplicating warc-wget I kind appreciate it being in python as to be *configurable*
06:36 🔗 tef now you just need to make it report to a generic tracker and upload directly to archive.org s3 :v
06:36 🔗 tef s/you/me/etc/
06:36 🔗 tef btw: it made it faster, scarily faster
06:37 🔗 kennethre hehe
06:37 🔗 kennethre yeah there's no reason it shouldn't be in python really
06:38 🔗 kennethre compiling warc-wget was far harder than installing a python dep
06:38 🔗 kennethre and we can script that all out
06:39 🔗 tef yeah
06:39 🔗 tef well shall i add you to crawler ?
06:40 🔗 tef it's not pretty but it is enough to consume requests and write warcs
06:41 🔗 tef i'm just tidying up the html extractor to make it more wget-like
06:43 🔗 kennethre awesome
06:43 🔗 kennethre nah i'll make my thing
06:43 🔗 kennethre and you can use it in your crawler
06:46 🔗 tef awesome
06:47 🔗 tef thing is I could always pull in another dep on warctools, unpack the raw bits from requests and dump them into a warc
06:48 🔗 tef I don't mind either way
06:48 🔗 tef although the way that involves less effort for me is somewhat preferable :v
07:11 🔗 tef now I want to rewrite it entirely
07:58 🔗 tef kennethre: fwiw I think it might be easier for me to just write warcs
07:58 🔗 tef you've shown me the light with requests
08:21 🔗 Coderjoe ugh
08:21 🔗 Coderjoe h264 does not belong in avi
08:23 🔗 Coderjoe (just ran across a movie on IA that the "Cinepack" file (which is their silly way of saying "avi", i guess) is h264+mp3 in avi
08:24 🔗 tef kennethre: one minor issue. there is no request.raw :3
08:49 🔗 tef well i've got it making pseudo warcs
08:50 🔗 tef https://github.com/tef/crawler/tree/master/scraper
08:51 🔗 tef but it has to reconstruct the http message from the request/response objects
09:56 🔗 dcmorton hey all, not sure if anybody is interested in this, but here's some stats on the VM that i've currently got downloading MobileMe.. http://networkwhisperer.com/cacti/
09:56 🔗 dcmorton been averaging about 250 gigs up/down a day
10:30 🔗 Coderjoe uhoh
10:30 🔗 Coderjoe http://www.theinquirer.net/inquirer/news/2144705/yahoo-sheds-chairman-directors
10:31 🔗 Coderjoe another big change happening at the top over at yahoo
10:32 🔗 ersi I'm axin' up axin' up, axin' up
10:32 🔗 ersi 'cause my shareholders taught me good
10:59 🔗 chronomex lubin' up the axe
11:07 🔗 Coderjoe hide your flickr, hide your ... is anything left of delicious?
13:25 🔗 godane SketchCow: http://good.net/dl/bd
13:26 🔗 godane this has a lot of videos from all the hacker cons
13:26 🔗 godane and your bbs documentary
13:27 🔗 godane it also has the convention.cdroms sections
13:28 🔗 godane Also i was wondering
13:30 🔗 godane does archive.org do some sort of dedupllication?
13:31 🔗 godane i just think it has too since people my upload the same video more then once
13:33 🔗 ersi I bet they do deduplication and replication
13:33 🔗 ersi I mean, considering the amounts of data they have/get
15:42 🔗 Schbirid SketchCow: is David R. Foley on your list for the arcade docu?
15:43 🔗 Schbirid emijrp: let's compare http://enjoys.it/jamendo/jamendo-archive-tcs_20120126.txt (some mb text)
16:37 🔗 Schbirid shite, any one got an idea what happened to klov.com?
16:38 🔗 Schbirid http://www.arcade-museum.com/ is dead for me
17:30 🔗 yipdw ahahaha
17:30 🔗 yipdw http://www.nytimes.com/2012/02/08/opinion/what-wikipedia-wont-tell-you.html?_r=1
17:31 🔗 yipdw it's cute how Cary Sherman assumes the US owns the Internet
17:31 🔗 yipdw wait, no, it's not cute
17:31 🔗 yipdw it's dangerously misguided
17:46 🔗 emijrp US owns the Internet.
17:50 🔗 emijrp US = Archive Team. Of course.
22:13 🔗 arrith Magnet-hashes for all torrents on The Pirate Bay: 164 MB (thepiratebay.se)
22:13 🔗 arrith http://news.ycombinator.com/item?id=3568393
22:13 🔗 arrith https://thepiratebay.se/torrent/7016365/The_whole_Pirate_Bay_magnet_archive
22:13 🔗 arrith magnet:?xt=urn:btih:938802790a385c49307f34cca4c30f80b03df59c&dn=The+whole+Pirate+Bay+magnet+archive&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.publicbt.com%3A80&tr=udp%3A%2F%2Ftracker.ccc.de%3A80
22:13 🔗 arrith from the description: "The only thing that's strange is that I found out only about 1.5 millions of torrents, while there is something about 4 millions of torrents in TPB footer. However, I think I am correct and TPB footer is not ;)"
22:14 🔗 arrith not knowing his collection method of the magnet links it's hard to figure out where/if he went wrong without doing my own
22:15 🔗 arrith actual content of the torrent is compressed and is only 90 MiB btw
22:33 🔗 Nemo_bis arrith, I've also set up a cronjob for this: http://www.archive.org/details/publicbt.com
22:33 🔗 Nemo_bis and that's 3 millions hashes
22:34 🔗 Nemo_bis but doesn't contain titles
22:34 🔗 arrith Nemo_bis: ah that's pretty fancy
22:35 🔗 Nemo_bis well, nothing special
22:35 🔗 arrith Nemo_bis: have you checked if any hashes get removed from newer grabs?
22:35 🔗 Nemo_bis torrent sites are a quite stupid thing after all
22:35 🔗 Nemo_bis arrith, obviously not!
22:35 🔗 Nemo_bis that's something I want others to do
22:35 🔗 arrith ah
22:35 🔗 Nemo_bis otherwise I wouldn't upload it to IA :-p
22:36 🔗 arrith heh
22:36 🔗 Nemo_bis in a few years, researchers will have a lot of data to work on in that item :D
22:36 🔗 arrith indeed
22:36 🔗 arrith Nemo_bis: is what you have already grabbed been posted on IA?
22:36 🔗 Nemo_bis arrith, what do you mean?
22:37 🔗 arrith Nemo_bis: i got the impression you were or were planning to upload it to IA
22:37 🔗 Nemo_bis also, you should upload that TPB archive to IA
22:37 🔗 Nemo_bis arrith, I've already uploaded everything
22:38 🔗 Nemo_bis it's just the list of hashes of PBt as published on their website
22:38 🔗 arrith Nemo_bis: ah. IA makes stuff public eventually right? has to be curated or something first though i guess
22:38 🔗 Nemo_bis it's already public, isn't it?
22:38 🔗 Nemo_bis unless some sysadmin has "censored" it
22:39 🔗 Nemo_bis there's nothing to hide there, it's just a list of hashes
22:39 🔗 arrith the uploaded IA stuff or publicbt's archives?
22:39 🔗 Nemo_bis doesn't mean anything in itself
22:39 🔗 Nemo_bis that item
22:39 🔗 Nemo_bis (as they explain in their home page)
22:39 🔗 arrith since i'm pretty sure publicbt's stuff changes, so one would need to get backdated version
22:39 🔗 arrith ah
22:40 🔗 arrith Nemo_bis: well i'm not getting anything for http://www.archive.org/search.php?query=all.txt.bz2
22:40 🔗 Nemo_bis you could use those hashes to get all torrents or all info about what they actually contain, but that's not something I'm going to do :D
22:40 🔗 Nemo_bis I do only the easy stuff
22:41 🔗 arrith right
22:41 🔗 Nemo_bis arrith, I doubt you can search filenames
22:41 🔗 arrith Nemo_bis: any idea how one might find them?
22:41 🔗 Nemo_bis arrith, find what?
22:41 🔗 Nemo_bis brb
22:43 🔗 DFJustin http://ia700807.us.archive.org/11/items/publicbt.com/
22:44 🔗 arrith nice
22:44 🔗 arrith ty DFJustin
22:44 🔗 DFJustin Nemo_bis: set the item mediatype to "data" or "web" so it shows the file links
22:45 🔗 Nemo_bis DFJustin, am I allowed to?
22:45 🔗 Nemo_bis isn't that for privileged users
22:46 🔗 DFJustin the permissions are actually available for anyone, just the web interface doesn't let you
22:47 🔗 DFJustin you can using s3 http://www.archive.org/help/abouts3.txt
22:47 🔗 Nemo_bis hm
22:47 🔗 DFJustin or I just use firebug to take the readonly attribute off the textbox
22:47 🔗 DFJustin :D
22:47 🔗 Nemo_bis last time I tried, I failed
22:47 🔗 Nemo_bis perhaps because I tried to upload to some other collecion
22:47 🔗 DFJustin yeah the collection stuff is locked down
22:47 🔗 Nemo_bis do you mean, change mediatype and leave in that collection
22:47 🔗 Nemo_bis ah ok
22:49 🔗 Nemo_bis DFJustin, do I have to respecify all metadata?
22:49 🔗 DFJustin dunno it kind of sounds like it but I haven't tried on an existing item
22:49 🔗 arrith heh have to tweak their page to get stuff to display. i wonder why they have it set locked down like that
22:50 🔗 Nemo_bis which page?
22:50 🔗 DFJustin for the text mediatype it makes sense since it shows the page reader interface and stuff, but if you upload something that's not a book, it doesn't convert and thus you see nothing
22:51 🔗 Nemo_bis DFJustin, I never manage to follow the example in the doc
22:52 🔗 Nemo_bis I get an HTML page with "A request of the requested method PUT requires a valid Content-length."
22:52 🔗 arrith yipdw: when you're around: can you think of any better way to get all magnet links on a torrent site, say the pirate bay, without basically grabbing the full html of each page?
22:52 🔗 arrith also anyone else with thoughts on that ^
22:53 🔗 Nemo_bis info_hash is often exposed on the page to let search engines index it
22:53 🔗 DFJustin the collection stuff puzzles me though, they have stuff like http://www.archive.org/details/open_source_software that they don't actually allow the unwashed to upload to (although the gatekeeping is sufficiently lax to allow random metadata-less arabic stuff)
22:53 🔗 Nemo_bis who can upload there?
22:54 🔗 DFJustin I think they have to grant permission on an account-by-account basis
22:54 🔗 arrith Nemo_bis: ah right, for other sites yeah. though i'm not sure tpb exposes the info_hash on pages. if it is it's well hidden and google can't see it since i've never gotten a tpb result for googling a userhash
22:55 🔗 DFJustin I tried emailing about it a while back but got a useless form letter
22:55 🔗 Nemo_bis heh
22:55 🔗 Nemo_bis well, I just put tags which make sense
22:56 🔗 Nemo_bis and when there's too much stuff to let around, someone with permissions gets sick of it and moves to the correct collection
22:56 🔗 Nemo_bis not my problem
22:56 🔗 DFJustin like, somehow this guy has the magic bits http://www.archive.org/details/homaled
22:57 🔗 Nemo_bis ...
22:58 🔗 yipdw arrith: unless they expose it some other way, grabbing the HTML of each page is the best you can do
22:58 🔗 DFJustin anyway you can just poke jason when he comes by
22:59 🔗 arrith yipdw: yeah i guess so
22:59 🔗 yipdw I'm not familiar with what services TPB provides
22:59 🔗 yipdw re: magnet URL tracking
22:59 🔗 yipdw unfortunately
22:59 🔗 yipdw so nothing is coming to mind atm
22:59 🔗 yipdw and I don't want to access TPB from work :P
22:59 🔗 arrith yipdw: they used to have a tracker but yeah afaik they currently serve up torrent files and list magnet data
22:59 🔗 arrith yipdw: haha. np
23:01 🔗 emijrp the biggest torrent in that archive is the geocities one
23:01 🔗 emijrp the PATCHED one
23:04 🔗 arrith neat
23:04 🔗 arrith emijrp, balrog: btw seems you both were hit by a netsplit and may want to check the public logs if you're interested in magnet link archive stuff
23:05 🔗 balrog arrith: it wasn't a netsplit here
23:05 🔗 balrog my phone battery gave out due to cold weather :p
23:06 🔗 arrith ouch
23:07 🔗 balrog went from 12% to 0, like that
23:53 🔗 emijrp arrith: http://www.archiveteam.org/index.php?title=The_Pirate_Bay
23:55 🔗 arrith emijrp: ty
23:56 🔗 emijrp time to update

irclogger-viewer