[01:16] SketchCow: I'm backing up defcon.org [01:17] its only 3.2gb [01:17] i'm not backing up the top domains of defcon.org like media.defcon.org [04:12] heh [04:12] 7008 yipdw 20 0 380m 81m 11m S 390 0.7 10:37.05 rbx [04:12] that's what I like to see, 390% CPU utilization from Ruby [04:12] almost all cores in use :D [04:13] awwwwyeah [04:32] yipdw: what are you using to archive it ? [04:33] tef: archive what? [04:33] ah shit I misread scrollback [04:33] was assuming that was what ruby was up to [04:33] christ how did I get ops? [04:33] oh [04:33] heh [04:34] no, it's crawling a graph [04:35] 149,566,088 bytes before, 215,611,184 after [04:35] you've nearly doubled the redis memory usage [04:35] I stopped writing to it [04:36] oh [04:36] I'm using my home machine right now because it can give me better performance [04:36] at some point I'll resync and move the crawler back [04:37] and by "better" I mean that I can get on the order of 1k work items/sec processed [04:37] the m1.small EC2 instance was doing like 100 [04:46] mmm [04:47] someone just tried probing my ec2 instance's ssh port [04:51] Ah I've missed you all [04:51] hullo [04:51] hrm? [04:53] tef: so do you think any changes would need to be made to requests? [04:53] well the only thing I can think of is to pass in something to the session, or making a session wrapper in warctools [04:54] personally i'd dump a callback or a method in session to capture raw data, and either pass it in as an argument or subclass it [04:55] so then it's just a def hook(request, response): write_warc(request.raw, response.raw) [04:55] I don't think this is a common use case [04:55] so i'd expect people to have a request dependency from using warctools.requests (say), rather than the other way around [04:56] on the other hand, having a generic hook would be nice for logging/debugging [04:56] a trace handler, really [04:56] I don't know the requests api too well enough to make a concrete suggestion [04:57] tef: that sounds overly complciated: ) [04:57] *complicated [04:57] keep it simple [04:57] ok [04:58] add a new event hook? that is fired with the request, response pair ? [04:58] i don't understand why that's neccesary [04:58] a warc records a single request right? or a bunch in one file? [04:59] well the thing is to do it we have to pair off the trequest/response records in the warc [04:59] with the WARC-Concurrent-To: header [04:59] and *technically* you need to know the request to parse the response [04:59] (because of HEAD requests) [04:59] but you can ignore it for nearly everything because they're rare in practice [05:00] what does WARC-Concurrent-To do? [05:00] but we need to know which request goes with which response [05:00] i still don't see how this is an issue :) [05:00] http://secretvolcanobase.org/~tef/boingboing_sopa.warc [05:00] if you look at the warc record [05:01] I write the id of the request record in the response concurrent-to header [05:01] and vice versa [05:01] yes, i see a request and response [05:01] so with the current event hooks, how would I match them up ? [05:02] i still don't see how this is special [05:02] i'm just not seeing something obvious or something :) [05:05] you make a request, you get a response [05:05] there's no disconnect [05:05] without it, the request and response to that request would have to be back to back. with it, you can interleave [05:05] if more than one request were to be made at the same time (or before the previous finished) [05:06] you mean http/1.1 pipelining or app-level connection pooling? [05:08] for example, a web browser is generally permitted to open two tcp connections to the server at the same time [05:08] and issue requests concurrently on them [05:08] yeah [05:08] you could open 5000 with requests [05:09] you don't not have a reference to the request though [05:09] it's a non-issue [05:09] responses have a request attribute [05:09] argh wireless [05:09] even if it didn't though, it wouldn't be an issue [05:12] kennethre: no I mean, if you passed in two event hooks, one gets the request, one gets the response [05:12] if you had multiple things or threads calling on the same session how do I know they match up [05:12] ah shit [05:12] if you had multiple things or threads calling on the same session how do I know they match up [05:13] kennethre: oh doh [05:13] I see [05:13] this is super simple [05:13] sessions are thread safe [05:13] 05:09 < kennethre> responses have a request attribute [05:13] even if they didn't [05:13] it wouldn't be an issue :) [05:13] yes [05:14] that was the bit I was missing [05:14] but hey, it's 5am :-) [05:14] haha [05:14] although I did wake up at midday.... [05:14] how's this [05:14] i feel like i could write this thing in 2 hours [05:14] so yeah I just write a nice hook [05:14] no hook [05:14] wrap [05:14] out of band :) [05:14] no need to mess with internals [05:15] well I was figuring a wrapper that passed in a hook :v but I guess wrapping it also works [05:15] and is probably simpler [05:15] I think I got labeled the architect because I over engineer things, or that I'm terrible at testing. It's a job title of shame. [05:15] you make a request, record what you need, get the response, record what you need [05:16] put them in a giant pool if you need, out comes a warc [05:16] i can even add this into requests itself [05:16] i've throught about it [05:16] response.warc [05:16] tef: hehe, I *hate* overengineering :) [05:16] kennethre: doesn't everyone :-) [05:16] it's my mission in life to abolish it [05:16] esp in python :) [05:17] well it's my mission in life to avoid it, it's effort [05:17] so a warc file could have 300 requests in it, right? [05:17] well 300 request/responses [05:17] (obviously) [05:18] technically a warcrecord starts with a warcinfo record with an anvl seperated (read key:value\r\n iirc) format [05:18] and if a server responds with a transfer encoding [05:18] like gzip, does it ungzip it first? [05:18] well, there are *two things* you can do [05:18] you can just ungzip it and unchunk it (we do that because other software is shitty at parsing) [05:18] or you can write a conversion record that is concurrent to the request/response record [05:18] which can contain the original or the converted one . [05:19] i'd probably have to doublecheck on that restriction but [05:19] i'll have to look at the spec [05:19] I think archive.org hates gzip encoding [05:19] but i feel like i could do this pretty easily [05:19] kennethre: it's an iso standard, but 1.00 is basically 0.17 [05:19] even add some nice gevent magic [05:19] yeah the thing is I realized it was easy but I figured i'd ask [05:20] i guess it needs to crawl too, though? [05:20] well I have some code to do basic crawling somewhere [05:20] I wrote it in an afternoon as a code sample so it's terrible [05:20] tef: i'd love to collaborate on this if you're in the mood for such a thing ;) [05:20] sure [05:20] tef: I've been meaning to get more involved with archiveteam [05:21] I was gonna just hack this shit up in warctools and push it and hope someone here find it useful to augment the warc-get bulldozers [05:21] I'm in :) [05:21] thing is, writing warcs is *relatively* straight forward [05:21] yeah [05:21] you could probably do it yourself pretty easily [05:21] yeah :) [05:22] it's basically VERSION CR LF HEADERS* CRLF BODY CRLF [05:22] if there's no aversion to dependencies, we could make it crazy awesome w/ gevent [05:22] I have to do error correcting parsing on a bunch of weird formats [05:22] i'll go to 11 :) [05:22] well I have no problem with them [05:23] but http://code.hanzoarchives.com/warc-tools/src/58d7d99406b0/hanzo/warctools/warc.py#cl-51 (btw now MIT) is how we write them [05:23] the boingboing warc is pretty much all you need beyond a warcinfo record [05:24] if you push something to git I can flesh out the warcwriting to make it standards compliant [05:24] they also like to be gzipped, record by record and then catted together [05:25] my github is tef fwiw [05:26] cos I am totally in a mood to code before bedtime, and we're at the edge of a release cycle so I have to not break work code today [05:26] awesome [05:26] I'm a member of the archiveteam org [05:26] might push something up there [05:26] we'll see [05:26] depends on my mood [05:26] i have 800 projects going on right now [05:26] heh [05:26] this one sounce nice and quick though, so i might kick it out ;) [05:27] well I am interested in making requests + warc output happen [05:27] are warcs binary? [05:27] utf8? any encoding [05:27] well, technically the body is in binary [05:27] the headers *can* be utf-8 [05:27] headers are supposed to be latin1 [05:27] but most of the values tend to be ascii anyway [05:27] grr [05:27] the headers of a warc file can be utf-8 [05:27] but the file itself [05:28] is binary [05:28] the http message is treated as an octet-stream [05:28] yes [05:28] with uf8 splashed in [05:28] gotcha [05:28] perfect [05:28] somewhat but you don't see it in practice [05:28] techincally headers can also be mime/quoted printable but I have never seen it [05:29] so yeah - warc headers/values are nominally utf-8 but it's best to write ascii - so % encode the url in Target-URI - and the http message is treated as bytes [05:30] excellent [05:30] and the newline for warc-records is CRLF [05:30] sounds quite strait forward [05:31] I have read the iso standard [05:31] well parsing it is more annoying than writing nice ones :-) [05:31] there should be a validator :) [05:31] there is warctools [05:31] i'm surprised you didn't just use HAL i think? [05:31] what is it [05:31] on pypi which has a warc2warc.py and a warcvalid.py [05:31] the one chrome uses [05:31] the chrome one has no headers [05:31] HAR [05:31] or didn't [05:31] wtf really? [05:32] this is the one heritrix produces, and people are using in practice in archives [05:32] it does [05:32] (warcs) [05:32] but it's ugly [05:32] yeah [05:32] oh i know it's standard now [05:32] warcs have a bunch of stuff for archivists (metadata, conversions, capture information) [05:33] looks like HAR doesn't have the body!? [05:33] there was one I saw which was essentially a http message in json to avoid having to parse http [05:33] yes [05:33] that's the one [05:34] thing is, when we're doing the compliance thing we sort of tout the whole 'you can see what went over the wire' thing [05:34] but really I didn't see the point much (also json unicode heh) [05:34] sometimes the interesting bits of http messages is the raw encoding and how it's broken [05:35] yet you don't store it transfer-encoded :) [05:35] ssssh [05:35] well we could but technically an upstream proxy is allowed to perform that transformation [05:35] and if it breaks we leave the original in [05:36] ah [05:36] fair enough [05:36] yep, that's what requests does too [05:36] awesome [05:36] this is going to be great [05:37] well if only there was a collaborative editor for python online I would be so happy [05:37] actually github does let me edit things in situ on the repo I just recalled [05:37] anyway, I am wondering what I can start writing [05:37] which might be of any help [05:38] but mostly it seems I am being an oracle for the iso warc standard [05:38] a feature wish list ;) [05:38] ok [05:44] https://notes.typo3.org/p/K2To4zZyGy [05:45] doing that with emphasis on correct iso output [05:45] ugh sometimes being a standards weenie is useful I guess [05:48] hey, requests still doesn't do a POST/GET redirect [05:48] rfc till death! [05:52] also utc, utf-8 or death [05:53] i like this guy [05:53] I have a friend in localization/internationalization [05:53] we have similar chants [05:53] including iso date times or death [05:54] unfortunately reality has a nasty way of making us deal with encoding issues [06:07] anyway, that should be enough to go on kennethre ? [06:13] https://notes.typo3.org/p/K2To4zZyGy [06:15] tef: excellent, that sound be perfect, thanks man :) [06:15] tef: I shall commence :) [06:16] cool [06:17] I was this close to trying to add it myself [06:17] well i've forked requests and checked it out [06:17] hmm [06:18] I can write somethign to do link extraction though [06:26] holy shit [06:26] so i've hacked up a crawler to use requests [06:26] three line change [06:27] https://github.com/tef/codesamples/tree/master/pyget [06:30] https://github.com/tef/crawler even (now) [06:33] haha excellent [06:33] well that's simple enough [06:35] yeah now it just needs warc output to be hacked in [06:35] thing is, as much as it seems to be duplicating warc-wget I kind appreciate it being in python as to be *configurable* [06:36] now you just need to make it report to a generic tracker and upload directly to archive.org s3 :v [06:36] s/you/me/etc/ [06:36] btw: it made it faster, scarily faster [06:37] hehe [06:37] yeah there's no reason it shouldn't be in python really [06:38] compiling warc-wget was far harder than installing a python dep [06:38] and we can script that all out [06:39] yeah [06:39] well shall i add you to crawler ? [06:40] it's not pretty but it is enough to consume requests and write warcs [06:41] i'm just tidying up the html extractor to make it more wget-like [06:43] awesome [06:43] nah i'll make my thing [06:43] and you can use it in your crawler [06:46] awesome [06:47] thing is I could always pull in another dep on warctools, unpack the raw bits from requests and dump them into a warc [06:48] I don't mind either way [06:48] although the way that involves less effort for me is somewhat preferable :v [07:11] now I want to rewrite it entirely [07:58] kennethre: fwiw I think it might be easier for me to just write warcs [07:58] you've shown me the light with requests [08:21] ugh [08:21] h264 does not belong in avi [08:23] (just ran across a movie on IA that the "Cinepack" file (which is their silly way of saying "avi", i guess) is h264+mp3 in avi [08:24] kennethre: one minor issue. there is no request.raw :3 [08:49] well i've got it making pseudo warcs [08:50] https://github.com/tef/crawler/tree/master/scraper [08:51] but it has to reconstruct the http message from the request/response objects [09:56] hey all, not sure if anybody is interested in this, but here's some stats on the VM that i've currently got downloading MobileMe.. http://networkwhisperer.com/cacti/ [09:56] been averaging about 250 gigs up/down a day [10:30] uhoh [10:30] http://www.theinquirer.net/inquirer/news/2144705/yahoo-sheds-chairman-directors [10:31] another big change happening at the top over at yahoo [10:32] I'm axin' up axin' up, axin' up [10:32] 'cause my shareholders taught me good [10:59] lubin' up the axe [11:07] hide your flickr, hide your ... is anything left of delicious? [13:25] SketchCow: http://good.net/dl/bd [13:26] this has a lot of videos from all the hacker cons [13:26] and your bbs documentary [13:27] it also has the convention.cdroms sections [13:28] Also i was wondering [13:30] does archive.org do some sort of dedupllication? [13:31] i just think it has too since people my upload the same video more then once [13:33] I bet they do deduplication and replication [13:33] I mean, considering the amounts of data they have/get [15:42] SketchCow: is David R. Foley on your list for the arcade docu? [15:43] emijrp: let's compare http://enjoys.it/jamendo/jamendo-archive-tcs_20120126.txt (some mb text) [16:37] shite, any one got an idea what happened to klov.com? [16:38] http://www.arcade-museum.com/ is dead for me [17:30] ahahaha [17:30] http://www.nytimes.com/2012/02/08/opinion/what-wikipedia-wont-tell-you.html?_r=1 [17:31] it's cute how Cary Sherman assumes the US owns the Internet [17:31] wait, no, it's not cute [17:31] it's dangerously misguided [17:46] US owns the Internet. [17:50] US = Archive Team. Of course. [22:13] Magnet-hashes for all torrents on The Pirate Bay: 164 MB (thepiratebay.se) [22:13] http://news.ycombinator.com/item?id=3568393 [22:13] https://thepiratebay.se/torrent/7016365/The_whole_Pirate_Bay_magnet_archive [22:13] magnet:?xt=urn:btih:938802790a385c49307f34cca4c30f80b03df59c&dn=The+whole+Pirate+Bay+magnet+archive&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.publicbt.com%3A80&tr=udp%3A%2F%2Ftracker.ccc.de%3A80 [22:13] from the description: "The only thing that's strange is that I found out only about 1.5 millions of torrents, while there is something about 4 millions of torrents in TPB footer. However, I think I am correct and TPB footer is not ;)" [22:14] not knowing his collection method of the magnet links it's hard to figure out where/if he went wrong without doing my own [22:15] actual content of the torrent is compressed and is only 90 MiB btw [22:33] arrith, I've also set up a cronjob for this: http://www.archive.org/details/publicbt.com [22:33] and that's 3 millions hashes [22:34] but doesn't contain titles [22:34] Nemo_bis: ah that's pretty fancy [22:35] well, nothing special [22:35] Nemo_bis: have you checked if any hashes get removed from newer grabs? [22:35] torrent sites are a quite stupid thing after all [22:35] arrith, obviously not! [22:35] that's something I want others to do [22:35] ah [22:35] otherwise I wouldn't upload it to IA :-p [22:36] heh [22:36] in a few years, researchers will have a lot of data to work on in that item :D [22:36] indeed [22:36] Nemo_bis: is what you have already grabbed been posted on IA? [22:36] arrith, what do you mean? [22:37] Nemo_bis: i got the impression you were or were planning to upload it to IA [22:37] also, you should upload that TPB archive to IA [22:37] arrith, I've already uploaded everything [22:38] it's just the list of hashes of PBt as published on their website [22:38] Nemo_bis: ah. IA makes stuff public eventually right? has to be curated or something first though i guess [22:38] it's already public, isn't it? [22:38] unless some sysadmin has "censored" it [22:39] there's nothing to hide there, it's just a list of hashes [22:39] the uploaded IA stuff or publicbt's archives? [22:39] doesn't mean anything in itself [22:39] that item [22:39] (as they explain in their home page) [22:39] since i'm pretty sure publicbt's stuff changes, so one would need to get backdated version [22:39] ah [22:40] Nemo_bis: well i'm not getting anything for http://www.archive.org/search.php?query=all.txt.bz2 [22:40] you could use those hashes to get all torrents or all info about what they actually contain, but that's not something I'm going to do :D [22:40] I do only the easy stuff [22:41] right [22:41] arrith, I doubt you can search filenames [22:41] Nemo_bis: any idea how one might find them? [22:41] arrith, find what? [22:41] brb [22:43] http://ia700807.us.archive.org/11/items/publicbt.com/ [22:44] nice [22:44] ty DFJustin [22:44] Nemo_bis: set the item mediatype to "data" or "web" so it shows the file links [22:45] DFJustin, am I allowed to? [22:45] isn't that for privileged users [22:46] the permissions are actually available for anyone, just the web interface doesn't let you [22:47] you can using s3 http://www.archive.org/help/abouts3.txt [22:47] hm [22:47] or I just use firebug to take the readonly attribute off the textbox [22:47] :D [22:47] last time I tried, I failed [22:47] perhaps because I tried to upload to some other collecion [22:47] yeah the collection stuff is locked down [22:47] do you mean, change mediatype and leave in that collection [22:47] ah ok [22:49] DFJustin, do I have to respecify all metadata? [22:49] dunno it kind of sounds like it but I haven't tried on an existing item [22:49] heh have to tweak their page to get stuff to display. i wonder why they have it set locked down like that [22:50] which page? [22:50] for the text mediatype it makes sense since it shows the page reader interface and stuff, but if you upload something that's not a book, it doesn't convert and thus you see nothing [22:51] DFJustin, I never manage to follow the example in the doc [22:52] I get an HTML page with "A request of the requested method PUT requires a valid Content-length." [22:52] yipdw: when you're around: can you think of any better way to get all magnet links on a torrent site, say the pirate bay, without basically grabbing the full html of each page? [22:52] also anyone else with thoughts on that ^ [22:53] info_hash is often exposed on the page to let search engines index it [22:53] the collection stuff puzzles me though, they have stuff like http://www.archive.org/details/open_source_software that they don't actually allow the unwashed to upload to (although the gatekeeping is sufficiently lax to allow random metadata-less arabic stuff) [22:53] who can upload there? [22:54] I think they have to grant permission on an account-by-account basis [22:54] Nemo_bis: ah right, for other sites yeah. though i'm not sure tpb exposes the info_hash on pages. if it is it's well hidden and google can't see it since i've never gotten a tpb result for googling a userhash [22:55] I tried emailing about it a while back but got a useless form letter [22:55] heh [22:55] well, I just put tags which make sense [22:56] and when there's too much stuff to let around, someone with permissions gets sick of it and moves to the correct collection [22:56] not my problem [22:56] like, somehow this guy has the magic bits http://www.archive.org/details/homaled [22:57] ... [22:58] arrith: unless they expose it some other way, grabbing the HTML of each page is the best you can do [22:58] anyway you can just poke jason when he comes by [22:59] yipdw: yeah i guess so [22:59] I'm not familiar with what services TPB provides [22:59] re: magnet URL tracking [22:59] unfortunately [22:59] so nothing is coming to mind atm [22:59] and I don't want to access TPB from work :P [22:59] yipdw: they used to have a tracker but yeah afaik they currently serve up torrent files and list magnet data [22:59]