[02:06] Hi, anyone know about yourfanfiction.com, they did go offline, any word from the archive team about backups, yourfanficiton.com said some time in advance, that they're in danger? [02:10] never heard of them [03:16] OK, that's enough of nemo's stuff [03:16] That's ONE way to spend six hours [06:07] So.. [06:07] I cannot seem to run yahooblog-grab [06:08] Running pipeline.py just quits without any output [11:51] www.jizzday.com [11:51] www.jizzday.com [14:04] All accounts will be backed up and made available for download (actually, you can do this now, but the new backups will be offline on archive.org and available forever.) [14:04] http://status.net/2013/01/09/preview-of-changes-to-identi-ca [14:04] To be trusted? [14:05] Could be a decoy to hold us off. [14:06] Do it anyway, for the sake of doing it. [14:06] ...unless they're in here. [16:05] if anything has been uploaded to IA, it can be checked on, even if it is dark [16:08] though knowing part of the identifier, or the uploader or collection helps find it [16:17] Coderjoe: what do you mean checked on? [16:17] it can be looked up in the catalog [16:18] though only admins or possibly local IA users would be able to access the files [16:18] with wildcard search you mean [16:18] I should use that I guess. [16:18] or the metamanager [16:19] which I don't know if there is limited access to [16:20] i think the features to do any changes require rights, but just doing queries doesn't [16:21] chages require shell access, IIRC queries adminship [16:28] i don't think shell access is needed for the changes I refer to in the metamanager [16:28] (like moving items between collections and the like) [16:34] hm [16:35] start my uploading of the screen savers has one per a item: http://archive.org/details/The.Screen.Savers.2004.04.01 [16:35] *i started [17:05] Viewing metamgr you to pass User::any_admin() [17:05] Which effectively means you need to be a collection owner [17:05] You get (more?) buttons if you're User::slash_admin() [17:06] which basically means you can frobnicate any item [17:06] To view a dark item's files, though, you have to have shell access [17:06] on the datanodes [17:15] underscor: can you find out why new robots.txt override older ones? [17:15] example of older robots.txt: http://web.archive.org/web/20040630192118/http://cetips.com/robots.txt [17:16] example of the newer ones: http://web.archive.org/web/20111004042827/http://cetips.com/robots.txt [17:17] "oh crap. we didn't want that to be stored on IA." perhaps? [17:18] more like the newer ones is cause of that bad website sitter that blocks ia bots [17:24] It is a policy decision to avoid legal drama. [17:25] But, in theory, we could tie website captures to whatever state the robots.txt had at the time of the capture [17:25] instead of the current one [17:25] However, there are a lot of people who use the robots.txt block thinking that their stuff wouldn't show up again, so we'd need some other way to "opt out" [17:58] godane1: the new robots.txt doesn't seem to imply that ia_archiver should be blocked at all? [18:00] i was able look at this site like more then a year ago [18:00] but now i can't [18:00] this in the newer robot.txt: User-Agent: ia_archiver [18:01] and a Disallow: is under that [18:03] godane1: yes but nothing is specified to be disallowed. [18:04] i know that [18:04] isn't that a bug in ia_archiver? :/ [18:04] maybe [18:04] I brought this up a week or two ago [18:04] it sure looks like a bug [18:05] maybe it just blocks anything if ia_archiver comes up [18:05] yeah but what if I specifically want to ALLOW ia_archiver for a site? (as it seems to be the case here) [18:05] http://www.robotstxt.org/robotstxt.html states that the syntax on that site is "To allow a single robot" [18:05] this is most definitely a bug ;( [18:06] err no it isn't [18:06] the reason it's blocked is this: http://web.archive.org/web/20120819150435/http://spi.domainsponsor.com/ds_robots.txt [18:06] a rogue robots.txt [18:06] thats what i thought too [18:07] maybe archive should have a special black list of robots.txt [18:07] if something like comes up then just ignore [18:07] or apply only to current/future crawls [18:07] and don't black out older ones [18:08] that would make the most sense [18:08] that works too [18:08] http://archive.org/post/423432/domainsponsorcom-erasing-prior-archived-copies-of-135000-domains — though jory2 derailed the thread :( [18:10] (at the end) [18:10] http://archive.org/post/433169/domainsponsorcom-monikercom-re-deleted-archive-after-domain-backorder etc [18:16] maybe ia could only block if the whois did not change between the time of retrival and the current robots.txt [18:19] it seems like the main problem is these large-scale squatters and that could be taken care of with a few special cases [18:20] DFJustin: yes, this is the main problem — the large-scale squatters [18:40] The bitsavers ingestion has hit its stride! [18:50] Nice, derivers were lazying again. [18:54] MOAR FATA. [19:27] http://archive.org/details/bitsavers now is starting to have individual companies [21:09] hello! how do I know/limit how much disk space the warrior uses? [21:38] I think it's under a gig, isn't it? [21:40] The disk image can grow to up to 60 GB. [21:41] ! [21:41] There's a way to give it more space, or less: disconnect the "data" disk from the VM, create a new virtual disk image of the size you want and connect it. [21:41] The warrior will format the new drive when it boots. [21:43] (The problem with these virtual disk images is that they only grow, never shrink. So even if though the warrior removes the downloaded files, the disk image will eventually grow to its full size.) [22:41] sigh.