#archiveteam 2012-02-08,Wed

↑back Search

Time	Nickname	Message
01:16 ^🔗	godane	SketchCow: I'm backing up defcon.org
01:17 ^🔗	godane	its only 3.2gb
01:17 ^🔗	godane	i'm not backing up the top domains of defcon.org like media.defcon.org
04:12 ^🔗	yipdw	heh
04:12 ^🔗	yipdw	7008 yipdw 20 0 380m 81m 11m S 390 0.7 10:37.05 rbx
04:12 ^🔗	yipdw	that's what I like to see, 390% CPU utilization from Ruby
04:12 ^🔗	yipdw	almost all cores in use :D
04:13 ^🔗	chronomex	awwwwyeah
04:32 ^🔗	tef	yipdw: what are you using to archive it ?
04:33 ^🔗	yipdw	tef: archive what?
04:33 ^🔗	tef	ah shit I misread scrollback
04:33 ^🔗	tef	was assuming that was what ruby was up to
04:33 ^🔗	tef	christ how did I get ops?
04:33 ^🔗	yipdw	oh
04:33 ^🔗	yipdw	heh
04:34 ^🔗	yipdw	no, it's crawling a graph
04:35 ^🔗	Coderjoe	149,566,088 bytes before, 215,611,184 after
04:35 ^🔗	Coderjoe	you've nearly doubled the redis memory usage
04:35 ^🔗	yipdw	I stopped writing to it
04:36 ^🔗	Coderjoe	oh
04:36 ^🔗	yipdw	I'm using my home machine right now because it can give me better performance
04:36 ^🔗	yipdw	at some point I'll resync and move the crawler back
04:37 ^🔗	yipdw	and by "better" I mean that I can get on the order of 1k work items/sec processed
04:37 ^🔗	yipdw	the m1.small EC2 instance was doing like 100
04:46 ^🔗	Coderjoe	mmm
04:47 ^🔗	Coderjoe	someone just tried probing my ec2 instance's ssh port
04:51 ^🔗	kennethre	Ah I've missed you all
04:51 ^🔗	tef	hullo
04:51 ^🔗	chronomex	hrm?
04:53 ^🔗	kennethre	tef: so do you think any changes would need to be made to requests?
04:53 ^🔗	tef	well the only thing I can think of is to pass in something to the session, or making a session wrapper in warctools
04:54 ^🔗	tef	personally i'd dump a callback or a method in session to capture raw data, and either pass it in as an argument or subclass it
04:55 ^🔗	tef	so then it's just a def hook(request, response): write_warc(request.raw, response.raw)
04:55 ^🔗	tef	I don't think this is a common use case
04:55 ^🔗	tef	so i'd expect people to have a request dependency from using warctools.requests (say), rather than the other way around
04:56 ^🔗	tef	on the other hand, having a generic hook would be nice for logging/debugging
04:56 ^🔗	tef	a trace handler, really
04:56 ^🔗	tef	I don't know the requests api too well enough to make a concrete suggestion
04:57 ^🔗	kennethre	tef: that sounds overly complciated: )
04:57 ^🔗	kennethre	*complicated
04:57 ^🔗	kennethre	keep it simple
04:57 ^🔗	tef	ok
04:58 ^🔗	tef	add a new event hook? that is fired with the request, response pair ?
04:58 ^🔗	kennethre	i don't understand why that's neccesary
04:58 ^🔗	kennethre	a warc records a single request right? or a bunch in one file?
04:59 ^🔗	tef	well the thing is to do it we have to pair off the trequest/response records in the warc
04:59 ^🔗	tef	with the WARC-Concurrent-To: header
04:59 ^🔗	tef	and technically you need to know the request to parse the response
04:59 ^🔗	tef	(because of HEAD requests)
04:59 ^🔗	tef	but you can ignore it for nearly everything because they're rare in practice
05:00 ^🔗	kennethre	what does WARC-Concurrent-To do?
05:00 ^🔗	tef	but we need to know which request goes with which response
05:00 ^🔗	kennethre	i still don't see how this is an issue :)
05:00 ^🔗	tef	http://secretvolcanobase.org/~tef/boingboing_sopa.warc
05:00 ^🔗	tef	if you look at the warc record
05:01 ^🔗	tef	I write the id of the request record in the response concurrent-to header
05:01 ^🔗	tef	and vice versa
05:01 ^🔗	kennethre	yes, i see a request and response
05:01 ^🔗	tef	so with the current event hooks, how would I match them up ?
05:02 ^🔗	kennethre	i still don't see how this is special
05:02 ^🔗	kennethre	i'm just not seeing something obvious or something :)
05:05 ^🔗	kennethre	you make a request, you get a response
05:05 ^🔗	kennethre	there's no disconnect
05:05 ^🔗	Coderjoe	without it, the request and response to that request would have to be back to back. with it, you can interleave
05:05 ^🔗	Coderjoe	if more than one request were to be made at the same time (or before the previous finished)
05:06 ^🔗	kennethre	you mean http/1.1 pipelining or app-level connection pooling?
05:08 ^🔗	Coderjoe	for example, a web browser is generally permitted to open two tcp connections to the server at the same time
05:08 ^🔗	Coderjoe	and issue requests concurrently on them
05:08 ^🔗	kennethre	yeah
05:08 ^🔗	kennethre	you could open 5000 with requests
05:09 ^🔗	kennethre	you don't not have a reference to the request though
05:09 ^🔗	kennethre	it's a non-issue
05:09 ^🔗	kennethre	responses have a request attribute
05:09 ^🔗	tef	argh wireless
05:09 ^🔗	kennethre	even if it didn't though, it wouldn't be an issue
05:12 ^🔗	tef	kennethre: no I mean, if you passed in two event hooks, one gets the request, one gets the response
05:12 ^🔗	tef	if you had multiple things or threads calling on the same session how do I know they match up
05:12 ^🔗	tef	ah shit
05:12 ^🔗	tef	if you had multiple things or threads calling on the same session how do I know they match up
05:13 ^🔗	tef	kennethre: oh doh
05:13 ^🔗	tef	I see
05:13 ^🔗	kennethre	this is super simple
05:13 ^🔗	kennethre	sessions are thread safe
05:13 ^🔗	tef	05:09 < kennethre> responses have a request attribute
05:13 ^🔗	kennethre	even if they didn't
05:13 ^🔗	kennethre	it wouldn't be an issue :)
05:13 ^🔗	tef	yes
05:14 ^🔗	tef	that was the bit I was missing
05:14 ^🔗	tef	but hey, it's 5am :-)
05:14 ^🔗	kennethre	haha
05:14 ^🔗	tef	although I did wake up at midday....
05:14 ^🔗	kennethre	how's this
05:14 ^🔗	kennethre	i feel like i could write this thing in 2 hours
05:14 ^🔗	tef	so yeah I just write a nice hook
05:14 ^🔗	kennethre	no hook
05:14 ^🔗	kennethre	wrap
05:14 ^🔗	kennethre	out of band :)
05:14 ^🔗	kennethre	no need to mess with internals
05:15 ^🔗	tef	well I was figuring a wrapper that passed in a hook :v but I guess wrapping it also works
05:15 ^🔗	tef	and is probably simpler
05:15 ^🔗	tef	I think I got labeled the architect because I over engineer things, or that I'm terrible at testing. It's a job title of shame.
05:15 ^🔗	kennethre	you make a request, record what you need, get the response, record what you need
05:16 ^🔗	kennethre	put them in a giant pool if you need, out comes a warc
05:16 ^🔗	kennethre	i can even add this into requests itself
05:16 ^🔗	kennethre	i've throught about it
05:16 ^🔗	kennethre	response.warc
05:16 ^🔗	kennethre	tef: hehe, I hate overengineering :)
05:16 ^🔗	tef	kennethre: doesn't everyone :-)
05:16 ^🔗	kennethre	it's my mission in life to abolish it
05:16 ^🔗	kennethre	esp in python :)
05:17 ^🔗	tef	well it's my mission in life to avoid it, it's effort
05:17 ^🔗	kennethre	so a warc file could have 300 requests in it, right?
05:17 ^🔗	tef	well 300 request/responses
05:17 ^🔗	kennethre	(obviously)
05:18 ^🔗	tef	technically a warcrecord starts with a warcinfo record with an anvl seperated (read key:value\r\n iirc) format
05:18 ^🔗	kennethre	and if a server responds with a transfer encoding
05:18 ^🔗	kennethre	like gzip, does it ungzip it first?
05:18 ^🔗	tef	well, there are two things you can do
05:18 ^🔗	tef	you can just ungzip it and unchunk it (we do that because other software is shitty at parsing)
05:18 ^🔗	tef	or you can write a conversion record that is concurrent to the request/response record
05:18 ^🔗	tef	which can contain the original or the converted one .
05:19 ^🔗	tef	i'd probably have to doublecheck on that restriction but
05:19 ^🔗	kennethre	i'll have to look at the spec
05:19 ^🔗	tef	I think archive.org hates gzip encoding
05:19 ^🔗	kennethre	but i feel like i could do this pretty easily
05:19 ^🔗	tef	kennethre: it's an iso standard, but 1.00 is basically 0.17
05:19 ^🔗	kennethre	even add some nice gevent magic
05:19 ^🔗	tef	yeah the thing is I realized it was easy but I figured i'd ask
05:20 ^🔗	kennethre	i guess it needs to crawl too, though?
05:20 ^🔗	tef	well I have some code to do basic crawling somewhere
05:20 ^🔗	tef	I wrote it in an afternoon as a code sample so it's terrible
05:20 ^🔗	kennethre	tef: i'd love to collaborate on this if you're in the mood for such a thing ;)
05:20 ^🔗	tef	sure
05:20 ^🔗	kennethre	tef: I've been meaning to get more involved with archiveteam
05:21 ^🔗	tef	I was gonna just hack this shit up in warctools and push it and hope someone here find it useful to augment the warc-get bulldozers
05:21 ^🔗	kennethre	I'm in :)
05:21 ^🔗	tef	thing is, writing warcs is relatively straight forward
05:21 ^🔗	kennethre	yeah
05:21 ^🔗	tef	you could probably do it yourself pretty easily
05:21 ^🔗	kennethre	yeah :)
05:22 ^🔗	tef	it's basically VERSION CR LF HEADERS* CRLF BODY CRLF
05:22 ^🔗	kennethre	if there's no aversion to dependencies, we could make it crazy awesome w/ gevent
05:22 ^🔗	tef	I have to do error correcting parsing on a bunch of weird formats
05:22 ^🔗	kennethre	i'll go to 11 :)
05:22 ^🔗	tef	well I have no problem with them
05:23 ^🔗	tef	but http://code.hanzoarchives.com/warc-tools/src/58d7d99406b0/hanzo/warctools/warc.py#cl-51 (btw now MIT) is how we write them
05:23 ^🔗	tef	the boingboing warc is pretty much all you need beyond a warcinfo record
05:24 ^🔗	tef	if you push something to git I can flesh out the warcwriting to make it standards compliant
05:24 ^🔗	tef	they also like to be gzipped, record by record and then catted together
05:25 ^🔗	tef	my github is tef fwiw
05:26 ^🔗	tef	cos I am totally in a mood to code before bedtime, and we're at the edge of a release cycle so I have to not break work code today
05:26 ^🔗	kennethre	awesome
05:26 ^🔗	kennethre	I'm a member of the archiveteam org
05:26 ^🔗	kennethre	might push something up there
05:26 ^🔗	kennethre	we'll see
05:26 ^🔗	kennethre	depends on my mood
05:26 ^🔗	kennethre	i have 800 projects going on right now
05:26 ^🔗	tef	heh
05:26 ^🔗	kennethre	this one sounce nice and quick though, so i might kick it out ;)
05:27 ^🔗	tef	well I am interested in making requests + warc output happen
05:27 ^🔗	kennethre	are warcs binary?
05:27 ^🔗	kennethre	utf8? any encoding
05:27 ^🔗	tef	well, technically the body is in binary
05:27 ^🔗	tef	the headers can be utf-8
05:27 ^🔗	kennethre	headers are supposed to be latin1
05:27 ^🔗	tef	but most of the values tend to be ascii anyway
05:27 ^🔗	kennethre	grr
05:27 ^🔗	tef	the headers of a warc file can be utf-8
05:27 ^🔗	kennethre	but the file itself
05:28 ^🔗	kennethre	is binary
05:28 ^🔗	tef	the http message is treated as an octet-stream
05:28 ^🔗	tef	yes
05:28 ^🔗	kennethre	with uf8 splashed in
05:28 ^🔗	kennethre	gotcha
05:28 ^🔗	kennethre	perfect
05:28 ^🔗	tef	somewhat but you don't see it in practice
05:28 ^🔗	tef	techincally headers can also be mime/quoted printable but I have never seen it
05:29 ^🔗	tef	so yeah - warc headers/values are nominally utf-8 but it's best to write ascii - so % encode the url in Target-URI - and the http message is treated as bytes
05:30 ^🔗	kennethre	excellent
05:30 ^🔗	tef	and the newline for warc-records is CRLF
05:30 ^🔗	kennethre	sounds quite strait forward
05:31 ^🔗	tef	I have read the iso standard
05:31 ^🔗	tef	well parsing it is more annoying than writing nice ones :-)
05:31 ^🔗	kennethre	there should be a validator :)
05:31 ^🔗	tef	there is warctools
05:31 ^🔗	kennethre	i'm surprised you didn't just use HAL i think?
05:31 ^🔗	kennethre	what is it
05:31 ^🔗	tef	on pypi which has a warc2warc.py and a warcvalid.py
05:31 ^🔗	kennethre	the one chrome uses
05:31 ^🔗	tef	the chrome one has no headers
05:31 ^🔗	kennethre	HAR
05:31 ^🔗	tef	or didn't
05:31 ^🔗	kennethre	wtf really?
05:32 ^🔗	tef	this is the one heritrix produces, and people are using in practice in archives
05:32 ^🔗	kennethre	it does
05:32 ^🔗	tef	(warcs)
05:32 ^🔗	kennethre	but it's ugly
05:32 ^🔗	kennethre	yeah
05:32 ^🔗	kennethre	oh i know it's standard now
05:32 ^🔗	tef	warcs have a bunch of stuff for archivists (metadata, conversions, capture information)
05:33 ^🔗	kennethre	looks like HAR doesn't have the body!?
05:33 ^🔗	tef	there was one I saw which was essentially a http message in json to avoid having to parse http
05:33 ^🔗	kennethre	yes
05:33 ^🔗	kennethre	that's the one
05:34 ^🔗	tef	thing is, when we're doing the compliance thing we sort of tout the whole 'you can see what went over the wire' thing
05:34 ^🔗	tef	but really I didn't see the point much (also json unicode heh)
05:34 ^🔗	tef	sometimes the interesting bits of http messages is the raw encoding and how it's broken
05:35 ^🔗	kennethre	yet you don't store it transfer-encoded :)
05:35 ^🔗	tef	ssssh
05:35 ^🔗	tef	well we could but technically an upstream proxy is allowed to perform that transformation
05:35 ^🔗	tef	and if it breaks we leave the original in
05:36 ^🔗	kennethre	ah
05:36 ^🔗	kennethre	fair enough
05:36 ^🔗	kennethre	yep, that's what requests does too
05:36 ^🔗	kennethre	awesome
05:36 ^🔗	kennethre	this is going to be great
05:37 ^🔗	tef	well if only there was a collaborative editor for python online I would be so happy
05:37 ^🔗	tef	actually github does let me edit things in situ on the repo I just recalled
05:37 ^🔗	tef	anyway, I am wondering what I can start writing
05:37 ^🔗	tef	which might be of any help
05:38 ^🔗	tef	but mostly it seems I am being an oracle for the iso warc standard
05:38 ^🔗	kennethre	a feature wish list ;)
05:38 ^🔗	tef	ok
05:44 ^🔗	tef	https://notes.typo3.org/p/K2To4zZyGy
05:45 ^🔗	tef	doing that with emphasis on correct iso output
05:45 ^🔗	tef	ugh sometimes being a standards weenie is useful I guess
05:48 ^🔗	kennethre	hey, requests still doesn't do a POST/GET redirect
05:48 ^🔗	kennethre	rfc till death!
05:52 ^🔗	tef	also utc, utf-8 or death
05:53 ^🔗	kennethre	i like this guy
05:53 ^🔗	tef	I have a friend in localization/internationalization
05:53 ^🔗	tef	we have similar chants
05:53 ^🔗	tef	including iso date times or death
05:54 ^🔗	tef	unfortunately reality has a nasty way of making us deal with encoding issues
06:07 ^🔗	tef	anyway, that should be enough to go on kennethre ?
06:13 ^🔗	tef	https://notes.typo3.org/p/K2To4zZyGy
06:15 ^🔗	kennethre	tef: excellent, that sound be perfect, thanks man :)
06:15 ^🔗	kennethre	tef: I shall commence :)
06:16 ^🔗	tef	cool
06:17 ^🔗	tef	I was this close to trying to add it myself
06:17 ^🔗	tef	well i've forked requests and checked it out
06:17 ^🔗	tef	hmm
06:18 ^🔗	tef	I can write somethign to do link extraction though
06:26 ^🔗	tef	holy shit
06:26 ^🔗	tef	so i've hacked up a crawler to use requests
06:26 ^🔗	tef	three line change
06:27 ^🔗	tef	https://github.com/tef/codesamples/tree/master/pyget
06:30 ^🔗	tef	https://github.com/tef/crawler even (now)
06:33 ^🔗	kennethre	haha excellent
06:33 ^🔗	kennethre	well that's simple enough
06:35 ^🔗	tef	yeah now it just needs warc output to be hacked in
06:35 ^🔗	tef	thing is, as much as it seems to be duplicating warc-wget I kind appreciate it being in python as to be configurable
06:36 ^🔗	tef	now you just need to make it report to a generic tracker and upload directly to archive.org s3 :v
06:36 ^🔗	tef	s/you/me/etc/
06:36 ^🔗	tef	btw: it made it faster, scarily faster
06:37 ^🔗	kennethre	hehe
06:37 ^🔗	kennethre	yeah there's no reason it shouldn't be in python really
06:38 ^🔗	kennethre	compiling warc-wget was far harder than installing a python dep
06:38 ^🔗	kennethre	and we can script that all out
06:39 ^🔗	tef	yeah
06:39 ^🔗	tef	well shall i add you to crawler ?
06:40 ^🔗	tef	it's not pretty but it is enough to consume requests and write warcs
06:41 ^🔗	tef	i'm just tidying up the html extractor to make it more wget-like
06:43 ^🔗	kennethre	awesome
06:43 ^🔗	kennethre	nah i'll make my thing
06:43 ^🔗	kennethre	and you can use it in your crawler
06:46 ^🔗	tef	awesome
06:47 ^🔗	tef	thing is I could always pull in another dep on warctools, unpack the raw bits from requests and dump them into a warc
06:48 ^🔗	tef	I don't mind either way
06:48 ^🔗	tef	although the way that involves less effort for me is somewhat preferable :v
07:11 ^🔗	tef	now I want to rewrite it entirely
07:58 ^🔗	tef	kennethre: fwiw I think it might be easier for me to just write warcs
07:58 ^🔗	tef	you've shown me the light with requests
08:21 ^🔗	Coderjoe	ugh
08:21 ^🔗	Coderjoe	h264 does not belong in avi
08:23 ^🔗	Coderjoe	(just ran across a movie on IA that the "Cinepack" file (which is their silly way of saying "avi", i guess) is h264+mp3 in avi
08:24 ^🔗	tef	kennethre: one minor issue. there is no request.raw :3
08:49 ^🔗	tef	well i've got it making pseudo warcs
08:50 ^🔗	tef	https://github.com/tef/crawler/tree/master/scraper
08:51 ^🔗	tef	but it has to reconstruct the http message from the request/response objects
09:56 ^🔗	dcmorton	hey all, not sure if anybody is interested in this, but here's some stats on the VM that i've currently got downloading MobileMe.. http://networkwhisperer.com/cacti/
09:56 ^🔗	dcmorton	been averaging about 250 gigs up/down a day
10:30 ^🔗	Coderjoe	uhoh
10:30 ^🔗	Coderjoe	http://www.theinquirer.net/inquirer/news/2144705/yahoo-sheds-chairman-directors
10:31 ^🔗	Coderjoe	another big change happening at the top over at yahoo
10:32 ^🔗	ersi	I'm axin' up axin' up, axin' up
10:32 ^🔗	ersi	'cause my shareholders taught me good
10:59 ^🔗	chronomex	lubin' up the axe
11:07 ^🔗	Coderjoe	hide your flickr, hide your ... is anything left of delicious?
13:25 ^🔗	godane	SketchCow: http://good.net/dl/bd
13:26 ^🔗	godane	this has a lot of videos from all the hacker cons
13:26 ^🔗	godane	and your bbs documentary
13:27 ^🔗	godane	it also has the convention.cdroms sections
13:28 ^🔗	godane	Also i was wondering
13:30 ^🔗	godane	does archive.org do some sort of dedupllication?
13:31 ^🔗	godane	i just think it has too since people my upload the same video more then once
13:33 ^🔗	ersi	I bet they do deduplication and replication
13:33 ^🔗	ersi	I mean, considering the amounts of data they have/get
15:42 ^🔗	Schbirid	SketchCow: is David R. Foley on your list for the arcade docu?
15:43 ^🔗	Schbirid	emijrp: let's compare http://enjoys.it/jamendo/jamendo-archive-tcs_20120126.txt (some mb text)
16:37 ^🔗	Schbirid	shite, any one got an idea what happened to klov.com?
16:38 ^🔗	Schbirid	http://www.arcade-museum.com/ is dead for me
17:30 ^🔗	yipdw	ahahaha
17:30 ^🔗	yipdw	http://www.nytimes.com/2012/02/08/opinion/what-wikipedia-wont-tell-you.html?_r=1
17:31 ^🔗	yipdw	it's cute how Cary Sherman assumes the US owns the Internet
17:31 ^🔗	yipdw	wait, no, it's not cute
17:31 ^🔗	yipdw	it's dangerously misguided
17:46 ^🔗	emijrp	US owns the Internet.
17:50 ^🔗	emijrp	US = Archive Team. Of course.
22:13 ^🔗	arrith	Magnet-hashes for all torrents on The Pirate Bay: 164 MB (thepiratebay.se)
22:13 ^🔗	arrith	http://news.ycombinator.com/item?id=3568393
22:13 ^🔗	arrith	https://thepiratebay.se/torrent/7016365/The_whole_Pirate_Bay_magnet_archive
22:13 ^🔗	arrith	magnet:?xt=urn:btih:938802790a385c49307f34cca4c30f80b03df59c&dn=The+whole+Pirate+Bay+magnet+archive&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.publicbt.com%3A80&tr=udp%3A%2F%2Ftracker.ccc.de%3A80
22:13 ^🔗	arrith	from the description: "The only thing that's strange is that I found out only about 1.5 millions of torrents, while there is something about 4 millions of torrents in TPB footer. However, I think I am correct and TPB footer is not ;)"
22:14 ^🔗	arrith	not knowing his collection method of the magnet links it's hard to figure out where/if he went wrong without doing my own
22:15 ^🔗	arrith	actual content of the torrent is compressed and is only 90 MiB btw
22:33 ^🔗	Nemo_bis	arrith, I've also set up a cronjob for this: http://www.archive.org/details/publicbt.com
22:33 ^🔗	Nemo_bis	and that's 3 millions hashes
22:34 ^🔗	Nemo_bis	but doesn't contain titles
22:34 ^🔗	arrith	Nemo_bis: ah that's pretty fancy
22:35 ^🔗	Nemo_bis	well, nothing special
22:35 ^🔗	arrith	Nemo_bis: have you checked if any hashes get removed from newer grabs?
22:35 ^🔗	Nemo_bis	torrent sites are a quite stupid thing after all
22:35 ^🔗	Nemo_bis	arrith, obviously not!
22:35 ^🔗	Nemo_bis	that's something I want others to do
22:35 ^🔗	arrith	ah
22:35 ^🔗	Nemo_bis	otherwise I wouldn't upload it to IA :-p
22:36 ^🔗	arrith	heh
22:36 ^🔗	Nemo_bis	in a few years, researchers will have a lot of data to work on in that item :D
22:36 ^🔗	arrith	indeed
22:36 ^🔗	arrith	Nemo_bis: is what you have already grabbed been posted on IA?
22:36 ^🔗	Nemo_bis	arrith, what do you mean?
22:37 ^🔗	arrith	Nemo_bis: i got the impression you were or were planning to upload it to IA
22:37 ^🔗	Nemo_bis	also, you should upload that TPB archive to IA
22:37 ^🔗	Nemo_bis	arrith, I've already uploaded everything
22:38 ^🔗	Nemo_bis	it's just the list of hashes of PBt as published on their website
22:38 ^🔗	arrith	Nemo_bis: ah. IA makes stuff public eventually right? has to be curated or something first though i guess
22:38 ^🔗	Nemo_bis	it's already public, isn't it?
22:38 ^🔗	Nemo_bis	unless some sysadmin has "censored" it
22:39 ^🔗	Nemo_bis	there's nothing to hide there, it's just a list of hashes
22:39 ^🔗	arrith	the uploaded IA stuff or publicbt's archives?
22:39 ^🔗	Nemo_bis	doesn't mean anything in itself
22:39 ^🔗	Nemo_bis	that item
22:39 ^🔗	Nemo_bis	(as they explain in their home page)
22:39 ^🔗	arrith	since i'm pretty sure publicbt's stuff changes, so one would need to get backdated version
22:39 ^🔗	arrith	ah
22:40 ^🔗	arrith	Nemo_bis: well i'm not getting anything for http://www.archive.org/search.php?query=all.txt.bz2
22:40 ^🔗	Nemo_bis	you could use those hashes to get all torrents or all info about what they actually contain, but that's not something I'm going to do :D
22:40 ^🔗	Nemo_bis	I do only the easy stuff
22:41 ^🔗	arrith	right
22:41 ^🔗	Nemo_bis	arrith, I doubt you can search filenames
22:41 ^🔗	arrith	Nemo_bis: any idea how one might find them?
22:41 ^🔗	Nemo_bis	arrith, find what?
22:41 ^🔗	Nemo_bis	brb
22:43 ^🔗	DFJustin	http://ia700807.us.archive.org/11/items/publicbt.com/
22:44 ^🔗	arrith	nice
22:44 ^🔗	arrith	ty DFJustin
22:44 ^🔗	DFJustin	Nemo_bis: set the item mediatype to "data" or "web" so it shows the file links
22:45 ^🔗	Nemo_bis	DFJustin, am I allowed to?
22:45 ^🔗	Nemo_bis	isn't that for privileged users
22:46 ^🔗	DFJustin	the permissions are actually available for anyone, just the web interface doesn't let you
22:47 ^🔗	DFJustin	you can using s3 http://www.archive.org/help/abouts3.txt
22:47 ^🔗	Nemo_bis	hm
22:47 ^🔗	DFJustin	or I just use firebug to take the readonly attribute off the textbox
22:47 ^🔗	DFJustin	:D
22:47 ^🔗	Nemo_bis	last time I tried, I failed
22:47 ^🔗	Nemo_bis	perhaps because I tried to upload to some other collecion
22:47 ^🔗	DFJustin	yeah the collection stuff is locked down
22:47 ^🔗	Nemo_bis	do you mean, change mediatype and leave in that collection
22:47 ^🔗	Nemo_bis	ah ok
22:49 ^🔗	Nemo_bis	DFJustin, do I have to respecify all metadata?
22:49 ^🔗	DFJustin	dunno it kind of sounds like it but I haven't tried on an existing item
22:49 ^🔗	arrith	heh have to tweak their page to get stuff to display. i wonder why they have it set locked down like that
22:50 ^🔗	Nemo_bis	which page?
22:50 ^🔗	DFJustin	for the text mediatype it makes sense since it shows the page reader interface and stuff, but if you upload something that's not a book, it doesn't convert and thus you see nothing
22:51 ^🔗	Nemo_bis	DFJustin, I never manage to follow the example in the doc
22:52 ^🔗	Nemo_bis	I get an HTML page with "A request of the requested method PUT requires a valid Content-length."
22:52 ^🔗	arrith	yipdw: when you're around: can you think of any better way to get all magnet links on a torrent site, say the pirate bay, without basically grabbing the full html of each page?
22:52 ^🔗	arrith	also anyone else with thoughts on that ^
22:53 ^🔗	Nemo_bis	info_hash is often exposed on the page to let search engines index it
22:53 ^🔗	DFJustin	the collection stuff puzzles me though, they have stuff like http://www.archive.org/details/open_source_software that they don't actually allow the unwashed to upload to (although the gatekeeping is sufficiently lax to allow random metadata-less arabic stuff)
22:53 ^🔗	Nemo_bis	who can upload there?
22:54 ^🔗	DFJustin	I think they have to grant permission on an account-by-account basis
22:54 ^🔗	arrith	Nemo_bis: ah right, for other sites yeah. though i'm not sure tpb exposes the info_hash on pages. if it is it's well hidden and google can't see it since i've never gotten a tpb result for googling a userhash
22:55 ^🔗	DFJustin	I tried emailing about it a while back but got a useless form letter
22:55 ^🔗	Nemo_bis	heh
22:55 ^🔗	Nemo_bis	well, I just put tags which make sense
22:56 ^🔗	Nemo_bis	and when there's too much stuff to let around, someone with permissions gets sick of it and moves to the correct collection
22:56 ^🔗	Nemo_bis	not my problem
22:56 ^🔗	DFJustin	like, somehow this guy has the magic bits http://www.archive.org/details/homaled
22:57 ^🔗	Nemo_bis	...
22:58 ^🔗	yipdw	arrith: unless they expose it some other way, grabbing the HTML of each page is the best you can do
22:58 ^🔗	DFJustin	anyway you can just poke jason when he comes by
22:59 ^🔗	arrith	yipdw: yeah i guess so
22:59 ^🔗	yipdw	I'm not familiar with what services TPB provides
22:59 ^🔗	yipdw	re: magnet URL tracking
22:59 ^🔗	yipdw	unfortunately
22:59 ^🔗	yipdw	so nothing is coming to mind atm
22:59 ^🔗	yipdw	and I don't want to access TPB from work :P
22:59 ^🔗	arrith	yipdw: they used to have a tracker but yeah afaik they currently serve up torrent files and list magnet data
22:59 ^🔗	arrith	yipdw: haha. np
23:01 ^🔗	emijrp	the biggest torrent in that archive is the geocities one
23:01 ^🔗	emijrp	the PATCHED one
23:04 ^🔗	arrith	neat
23:04 ^🔗	arrith	emijrp, balrog: btw seems you both were hit by a netsplit and may want to check the public logs if you're interested in magnet link archive stuff
23:05 ^🔗	balrog	arrith: it wasn't a netsplit here
23:05 ^🔗	balrog	my phone battery gave out due to cold weather :p
23:06 ^🔗	arrith	ouch
23:07 ^🔗	balrog	went from 12% to 0, like that
23:53 ^🔗	emijrp	arrith: http://www.archiveteam.org/index.php?title=The_Pirate_Bay
23:55 ^🔗	arrith	emijrp: ty
23:56 ^🔗	emijrp	time to update

irclogger-viewer