#archiveteam-bs 2012-11-30,Fri

↑back Search

Time Nickname Message
00:02 🔗 dashcloud Yottabyte is old news now- Hellabyte will be replacing it as a measure of really big data amounts: https://twitter.com/timoreilly/status/274198934409338880
00:19 🔗 godane joepie91: i grabbed that site
00:19 🔗 godane i had to put lines in a index for a good mirror but i think i got it all
00:33 🔗 godane also my arstechnica.com grab is going good
00:34 🔗 godane i have 2001 thur 2006
00:34 🔗 godane grabing 2007
00:35 🔗 godane i will do the image dumps once i'm up to 2011 grabs
00:36 🔗 godane i will most likely only look for images in cdn.arstechnica.com
03:33 🔗 ivan` underscor: are you still on my UT torrent
07:35 🔗 godane i'm check url links with on computerpoweruser.com with dirbuster
09:02 🔗 SmileyG organics...
09:10 🔗 norbert79 Female organs
12:16 🔗 joepie91 godane: raided4tor or the RCT thingie?
20:32 🔗 chronomex "DailyBooth, a social network that lets users share photos in real-time, has raised $6 million in a first round of funding."
20:32 🔗 chronomex (from march 2011)
20:32 🔗 chronomex ummm, ... don't all social networks do that?
20:34 🔗 norbert79 But they are more... HIP
20:37 🔗 chronomex "The company says that to date 13 million photos have been shared via its service"
20:37 🔗 * chronomex does math
20:37 🔗 chronomex hmmm
20:38 🔗 chronomex $0.45/photo ... that's more expensive than a photo lab
21:52 🔗 SketchCow http://2.bp.blogspot.com/-zQsjB4cj1cQ/UF3eLl-749I/AAAAAAAAgEE/jbiuy95wOQY/s1600/Christina+Aguilera+Cleavage+-+on+%27The+Voice%27+1.jpg
22:43 🔗 swebb Is there an #archiveteam-nsfw channel? :)
22:44 🔗 chronomex is there an #archiveteam-sfw channel?
22:50 🔗 BlueMax this is the NSFW channel, the other one is the SFW channel
23:12 🔗 SketchCow There is no safe for work channel.
23:27 🔗 godane SketchCow: i think we need some sort of dedup check with warc
23:27 🔗 godane so it doesn't add a 88mb file like 3 times thats the same size and checksum
23:28 🔗 SketchCow In what context?
23:29 🔗 godane the crypto.stanford.edu has boxes-040405.tar.bz2 file in like 4 different urls
23:29 🔗 godane http://crypto.stanford.edu/cs155old/cs155-spring07/boxes-040405.tar.bz2
23:29 🔗 SketchCow I'd not be worried that much.
23:30 🔗 godane http://crypto.stanford.edu/cs155old/cs155-spring06/boxes-040405.tar.bz2
23:30 🔗 godane http://crypto.stanford.edu/cs155old/cs155-spring05/boxes-040405.tar.bz2
23:30 🔗 godane http://crypto.stanford.edu/cs155old/cs155-spring04/boxes-040405.tar.bz2
23:31 🔗 alard There must be easier ways to save much more space?
23:31 🔗 SketchCow This is an interesting debate.
23:32 🔗 SketchCow Also, this is quite an outlier thing you're going after, Godane
23:32 🔗 godane i know
23:32 🔗 SketchCow You're downloading coursework and example VMs for Stanford Crypto CS courses?
23:32 🔗 godane crypto.stanford.edu was not mirrored much on archive.org
23:33 🔗 godane on the wayback machine
23:33 🔗 godane i was just think there was pdfs and docs there
23:34 🔗 godane but at least you can say you have a full mirror of it
23:45 🔗 godane my thought on how the dedup would work before being add to warc is to check if there is a file of same byte size in warc
23:46 🔗 godane then check if the checksum is the same before adding to warc or refering the older file in the warc for the current url
23:47 🔗 dashcloud from twitter, here's an example of a spectacular metadata fail: http://digitalnz.org/records/23254060

irclogger-viewer