[01:16] so, apparently you do need to have word: word for the warc header command [01:18] holy crap- Angelfire is still live and changing site, and not just a repository of 90's websites [01:19] My old angelfire is still up, kinda [01:21] there's a split between the old style and new style (which you can still get- they sell hosting and websites still) [01:22] apparently at some point angelfire and tripod became linked? (corporately at least) [01:26] dashcloud: they all lync sites arnt they?? [01:26] i think thats how it was spekt, at work, cant check. [01:27] that's certainly my hope for my archive job [01:28] hopefully with a small number of sites, I can cover a huge amount of angelfire just by visiting every angelfire link on the page [07:51] "balrog> does anyone know if Cameron Kaiser (of tenfourfox/classilla) is on twitter?" [07:52] https://twitter.com/doctorlinguist , but it's protected. [07:54] his last tweet was in 2012, so I guess he doesn't use twitter anymore [12:14] garyrh: :/ ok [12:14] I do follow him [12:14] oh, he's active on ADN [12:50] Please add to archivebot, I'm told it's going offline http://137.204.24.205/cis13b/bsco3/Default.asp [13:02] Nemo_bis: the whole domain? [13:04] ivan`: I'm not sure, at least that cis13b/ directory which has some irreplaceable stuff [14:36] ivan`, Nemo_bis: is that already being taken care of? [14:59] only partially [15:29] ivan`: how partially? as in, is there something that needs to be done still :P [15:52] so i think i fucked my cbsradio collection some how [15:53] https://archive.org/details/cbsradio-hourly-2009-07-30 [15:53] no mp3 at all [15:53] all cause i was trying to fix a typo [15:53] :'( [15:54] that reminds me [15:54] godane: [15:54] I have a -lot- of podcasts still [15:54] from nhk [15:54] did you ever end up fetching those? [15:55] i don't think so [15:55] maybe you should :P [15:55] there on your remote sever right [15:56] cause other wise i will only be able to get the last 7 days [15:57] SketchCo1: are you moving my cbsradio items right now? [15:58] cause i'm finding items that 0 files in them [15:58] but others have files [16:01] joepie91: maybe you should upload your collection of nhk mp3s [16:01] they way its one less thing for me to do [16:03] i know remember why that hourly one doesn't have mp3s [16:03] godane: ah, you're short on time? [16:04] i don't remember the rsync [16:04] and yes, they're on a server of mine [16:04] rsync://croissant.cryto.net/nhk [16:04] they need deduplication though [16:04] (you can tell from the last modified timestamp) [16:04] if you don't have the time, let me know and I'll put it on my todo [16:08] Nemo_bis, ivan`, joepie91: that directory http://137.204.24.205/cis13b/ is it being grabbed or what? [16:19] exmic: that's what I was trying to establish [16:19] oh [16:19] right [16:19] [18:12] <+ATGoKart> balrog: Your job for http://cis.alma.unibo.it/cis13b/bsco3/Default.asp has finished. [16:19] that's /cis13b/bsco3 [16:19] that's why my ctrl+F didn't find it then [16:19] yes [16:19] the /cis13b/ dir doesn't have a listing [16:19] not /cis13b as requested [16:19] ah ok [16:19] so unless you have a db of URLs handy... [16:20] nope [16:20] coming to think of it [16:20] we could ask IA [16:20] I should add a !ia command to archivebot [16:20] all it does is check the URL and tell you whether or not wayback has it [16:20] and/or is blocked by robots txt etc [16:20] exmic: https://web.archive.org/web/*/http://cis.alma.unibo.it/cis13b/* [16:20] (I should get back to working on it, period) [16:20] yipdw: godo it! :P [16:21] go do * [16:21] I don't have time to supervise an archivebot job this week [16:21] and be sure to make it return the last archival date [16:21] yeah, I'll get back to it once I have less crap to do [16:22] The most interesting stuff is in that subdir AFAIK [16:22] Sorry for confusion [18:13] Hi [18:16] ohai [19:01] guy claims to have 8tb of geocities http://www.reddit.com/r/DataHoarder/comments/27y8ux/standing_up_40tbs_of_data_for_fun_times/ [19:47] And 80 % of it is in multiple copies of the stock geocities gifs? :P [19:49] «We don't the dedup the content in any way.» So might be. [20:23] They DON'T the dedup? :p [20:44] dededup? [20:45] do not-the-dedup? [20:45] I dedededup all my files. [21:33] they warc and get all the dups same as archiveteam does these days [21:33] seems like it would be a very nice dataset to pull into wayback [21:44] "We got the archive from the archive team in the first case, so I would hope its the same" [21:46] huh, I didn't think the geocities rip was anywhere near 8tb [22:53] this could be me being daft, but the internetarchive python script for uploading doesnt let you specify a certain catagory? like video, web etc etc? [22:59] --metadata="mediatype:movies" --metadata="collection:opensource_movies" [22:59] I would assume [23:11] yep, me being daft most likely. and sleepdeprived [23:15] goodnight custodis pro datus, or keepers of data [23:27] custodes pro datis [23:31] https://maps.google.com/locationhistory/b/0 [23:31] I am horrified by what google knows about my comings and goings [23:57] so, my laptop froze and I had to power it off, killing my ongoing wget-warc grab. If I re-run the command, will it overwrite the existing warc or create a new one?