[06:49] scan _all_ the internet [07:08] Scanning [10:48] just grabbed all videos on this playlist, hahaha oh wow [10:48] http://www.youtube.com/watch?v=u9MpsAftCDk&list=PLAX8JHUJcFR2gh_WG3YJBITuO-tODVCcJ&index=3 [14:11] http://wwdbam.com/category/podcasts/ keeps archives but they purge old ones frequently [14:11] so it's not really "archives" [14:30] balrog: i'm sending it to archivebot [14:30] godane: thing is, it's something that would need to be archived periodically :/ [14:31] i know [14:50] perhaps archivebot should have a --scheduled flag, cc yipdw [16:07] OK, I need help. [16:07] ftp.sunet.se [16:08] It's too big. I can't have FOS do the work of downloading it. Can people please team up and take pieces? [16:29] SketchCow, try #effteepee [17:05] what's the best way to propose a site as a new archive project? [17:05] also: [17:05] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [17:05] yahoosucks [17:11] it's not a deathwatch project in the traditional sense, but important content has a tendency to go missing after a few years [17:11] most ppl don't notice due to influx of new content [17:11] and it's non-accessible by archive.org's crawlbot [17:13] +email inquiries for missing content go unanswered [17:16] GChriss: if it's not an absolutely huge site, you could get someone to check it out in #archivebot [17:17] Don't keep us in suspense, bro [17:18] it's moderate in all: mostly text + occasional video [17:18] that would be the Knight News Challenge [17:19] oh that! [17:19] * SketchCow is on that [17:20] things would be easier if IA's "archive this page" was a single URL, w/o "click here to archive" javascript [17:20] there is a bookmarklet but it never really worked well for me iirc [17:21] there's new security restrictions that limit bookmarket functionality [17:22] URL downloads no longer supported [17:36] Archivebot can handle it. [17:40] there's a "View More" button at the bottom of the entries page: can archive bot read past this? (I think not?) [17:40] https://www.newschallenge.org/challenge/libraries/submissions/ [17:40] also no robots.txt [17:41] I've manually submitted the ~600 projects to IA in the last round [17:42] don't let that throw you [17:46] probably not [18:48] I've got a process/project underway to get as much data off FOS as possible, one of those clean-throughs I do every month or so. If you see a shitload of stuff I'm uploading, that's what it is. [19:16] Ancestry is 2tb of love, that's going in [19:24] Holy moly. [19:32] Awesome SketchCow! I'm excited to see it show up in the wayback machine :) [19:53] TONS of tiny accounts in these. [19:53] I dropped per-item to 40gb because there's so many in each one. [19:53] Which means lots of items. [19:56] SketchCow: i'm doing my monthly upload cleaning too [19:56] at least get the news collection up to date [20:04] SketchCow: i uploaded 3 dvds of linux format the other day [20:04] disk 186, 187, and 188 [20:07] Great [20:40] SketchCow: there are 4 websites for ancestry: mundia.com, myfamily.com, mycanvas.com and genealogy.com/familytreemaker.genealogy.com/familyorigins.com [20:40] mycanvas is staying (see websites) [20:41] mundia and myfamily are going away [20:41] genealogy has announced to make everything read-only [20:41] so I think it would be a good idea to keep archiving everything from genealogy, since it's now read-only and now changes will be made anymore [21:05] No arguments here. [21:05] I'm just shoving out stuff from the buffer machine into the wayback. [21:28] https://8chan.co/rip.txt ;_;7 [22:10] Hi! Is there a copy of the file "urls-2011-11-29-2200.tar.bz2" available? It was at http://db.tt/GNrEh61y (linked from http://archiveteam.org/index.php?title=Knol ) but is now gone. Also: the wiki page on Knol lists it as "saved", with a link to the Archives page, but I don't see any reference to it there. Thanks! [22:28] * joepie91 looks at shortened URL and hisses [22:31] okay, fair, it was a service-specific shortened URL [22:31] but still. [22:33] imo, expanded dropbox urls aren't any better than db.tt urls [22:51] kyan: that's an older grab before we had our processes fully figured out, I checked the usual places and don't see it so I don't know where it ended up [22:52] hopefully whoever did it is still around [22:55] would someone be able to jog my memory? i have here a few tens of GB of hg and svn repo dumps in a directory named "~/archiveteam/oracle", timestamped around mid february 2013 [22:56] xmc: Sun panicsave, maybe? [22:56] right, but what was it? :P [22:56] some xen stuff [22:56] not sure [22:56] Oracle acquisition of Sun seems like a valid reason to me to Save All The Things [22:56] right [22:57] well, ok. [22:57] I would look at my irc logs but I'm kind of doing other things [22:58] [15:58:21] in case you aren't aware, the opensolaris website is going away soon [22:58] [15:58:23] it needs to be archived and the Mercurial repositories do as we'll [22:58] every time I want to free up space on my laptop I notice that directory, and then forget later to check where it has gone [22:58] ok, that must be it [22:59] looks like this stuff never made it onto IA: https://archive.org/search.php?query=opensolaris%20collection%3Aarchiveteam-fire [22:59] I'll push it up later today when I'm at a place with better neternets [22:59] DFJustin: Oh, oh well :( [23:02] looks like it was http://archiveteam.org/index.php?title=User:Emijrp [23:02] emi [23:02] DFJustin: Thanks, i'll send them an email :) [23:03] I think emijrp is around still intermittently [23:03] let us know how it turns out, it needs to get reuploaded into an archive.org item in our collection [23:10] Shot an email off to them: https://archive.org/download/mail.google.com-saved-1Oct2014/mail.google.com-saved-1Oct2014.mail [23:13] that's a very weird thing to put on IA [23:13] but ok [23:14] I usually upload anything that seems like it might be of interest to anyone, correspondence, archives of websites, home videos, etc [23:14] I really have a visceral hatred of data being discarded [23:15] so I almost always save things. In as many places as possile. [23:15] fair enough [23:16] rm stuff [23:20] xmc: weird shit makes the world go 'round [23:20] :) [23:20] (and then there's those fools who think it was this thing called 'money'...) [23:21] heh [23:21] money doesn't make the world go 'round but it is a good lubricant [23:25] SketchCow: around? [23:25] somebody got a "no space left on device" on IA [23:26] that's probably Not Good [23:26] said somebody is in this channel... [23:26] * joepie91 stares [23:26] it happens all the time on individual nodes I think [23:26] suggested workaround? [23:27] joepie91, :3 [23:27] eventually someone comes around and moves stuff off the affected node [23:28] DFJustin: ia python module sends sizehint, does it not? [23:28] there's plenty of space overall https://home.archive.org/~tracey/mrtg/df-week.png [23:28] shouldn't that theoretically keep stuff like this from occurring? [23:29] I don't know if it sends it or not but yes that is supposed to prevent it [23:29] if the item is in the terabytes range then it may be inevitable [23:30] 309G [23:30] per ohhdemgir [23:30] single tar [23:31] I dunno how they arrange things but it's conceivable that no node would have that much free at any given time and it would just give you the least full one [23:31] hrm. [23:43] also, context: https://catalogd.archive.org/history/2014.09.vimeoartofnakedness [23:45] ah, so [23:45] the item was initially created with a txt file [23:45] then the .tar file was attempted to be added in another operation [23:46] the size hint thing only affects the initial item creation as that is when it picks which node to put the stuff on [23:50] it looks like there is space on the server in question so they may have fixed it by now and it may be enough to just re-run the archive job but I'll leave that to someone who knows more [23:52] The disk it's on only has 277gb free [23:52] Emptying it to 320G now [23:53] https://catalogd.archive.org/log/337363494