[00:02] Yottabyte is old news now- Hellabyte will be replacing it as a measure of really big data amounts: https://twitter.com/timoreilly/status/274198934409338880 [00:19] joepie91: i grabbed that site [00:19] i had to put lines in a index for a good mirror but i think i got it all [00:33] also my arstechnica.com grab is going good [00:34] i have 2001 thur 2006 [00:34] grabing 2007 [00:35] i will do the image dumps once i'm up to 2011 grabs [00:36] i will most likely only look for images in cdn.arstechnica.com [03:33] underscor: are you still on my UT torrent [07:35] i'm check url links with on computerpoweruser.com with dirbuster [09:02] organics... [09:10] Female organs [12:16] godane: raided4tor or the RCT thingie? [20:32] "DailyBooth, a social network that lets users share photos in real-time, has raised $6 million in a first round of funding." [20:32] (from march 2011) [20:32] ummm, ... don't all social networks do that? [20:34] But they are more... HIP [20:37] "The company says that to date 13 million photos have been shared via its service" [20:37] * chronomex does math [20:37] hmmm [20:38] $0.45/photo ... that's more expensive than a photo lab [21:52] http://2.bp.blogspot.com/-zQsjB4cj1cQ/UF3eLl-749I/AAAAAAAAgEE/jbiuy95wOQY/s1600/Christina+Aguilera+Cleavage+-+on+%27The+Voice%27+1.jpg [22:43] Is there an #archiveteam-nsfw channel? :) [22:44] is there an #archiveteam-sfw channel? [22:50] this is the NSFW channel, the other one is the SFW channel [23:12] There is no safe for work channel. [23:27] SketchCow: i think we need some sort of dedup check with warc [23:27] so it doesn't add a 88mb file like 3 times thats the same size and checksum [23:28] In what context? [23:29] the crypto.stanford.edu has boxes-040405.tar.bz2 file in like 4 different urls [23:29] http://crypto.stanford.edu/cs155old/cs155-spring07/boxes-040405.tar.bz2 [23:29] I'd not be worried that much. [23:30] http://crypto.stanford.edu/cs155old/cs155-spring06/boxes-040405.tar.bz2 [23:30] http://crypto.stanford.edu/cs155old/cs155-spring05/boxes-040405.tar.bz2 [23:30] http://crypto.stanford.edu/cs155old/cs155-spring04/boxes-040405.tar.bz2 [23:31] There must be easier ways to save much more space? [23:31] This is an interesting debate. [23:32] Also, this is quite an outlier thing you're going after, Godane [23:32] i know [23:32] You're downloading coursework and example VMs for Stanford Crypto CS courses? [23:32] crypto.stanford.edu was not mirrored much on archive.org [23:33] on the wayback machine [23:33] i was just think there was pdfs and docs there [23:34] but at least you can say you have a full mirror of it [23:45] my thought on how the dedup would work before being add to warc is to check if there is a file of same byte size in warc [23:46] then check if the checksum is the same before adding to warc or refering the older file in the warc for the current url [23:47] from twitter, here's an example of a spectacular metadata fail: http://digitalnz.org/records/23254060