[00:20] Hey guys just a thought but I noticed something today, archive.org now gives you the option to archive any page if you put it into the waybackmachine and it is not already archived and also auto archives various pages if they are linked to from within a archived page but not archived themselves, so would it be possible to write a script to use this to expand the archive.org archives as well, more or less [00:20] them if the option to archive them pops up [00:21] I saw that, it's a real nice-to-have. [00:22] Indeed, if we had had it during the isohunt bit the entire site could have been archived in a few hours [00:23] Of course the archive would then be on archive.org servers, but you could just write a script to then go to all the pages archived there and make a copy, would probably take longer in the long run since the pages would be archived twice, but for emergency archiving it would make for quick easy work [00:30] Wrong. [00:30] I mean, nice thought, but wrong. [00:30] 1. DDOS the archive.org server with per-page hits on sites, and it will shut you out. [00:30] 2. Archive.org follows rules archive team does not. [00:31] 3. Archive.org digs but not comprehensively, not to the ends of the server's earth. [00:32] They serve different purposes. [00:33] IIRC when browsing the Wayback Machine it's not possible to click a link a have it a find matching archived page from other dates. [01:37] it is [01:38] ish [01:38] replace the date in the url with * [02:16] because of university policy moving stuff to bbvista/webct, anything under http://www.ece.drexel.edu/courses/ is unlikely to be around for more than a couple more years. no immediate delete date. Ia already has most of the interesting stuff though, so i dunno if anything really needs to be done [02:18] http://usesold.com/ <- merged to dropbox, site probably closing soon [02:22] archivebot has grabbed usesold.com [02:22] Lord_Nigh: do you know if /courses/ is > 40GB? [02:23] no idea. it has no index though, that i know of [02:23] ... ok, i'm surprised [02:23] i never actuallt tried the link itself [02:24] but it does have an index [02:24] not sure how large it is tbh [02:24] i don't have warc set up here properly to grab it or i would [02:26] I started a local grab, hopefully it is small; I'll make archivebot grab it if it is [02:27] PPT slides aren't *that* big :) [02:35] So, unless there's a specific reason otherwise, I'd prefer if Archivebot grabbed everything. [02:46] small = <40GB, that is small enough not to break my userscripts grab ;) [02:49] judging by how fast this is going i doubt this is even 2gb [03:43] it's 998MB [16:25] SketchCow: this might not sound right ; but for the developer(s) working on wget/warc, jsmess, etc, is retribution purely in the form of respect from your peers? or do people get paid at all? [16:25] s/right/appropriate/ [16:26] i guess being aligned with internet archive might have some benefits, since they are a non-profit that you can donate to? [16:27] feel free to ignore me if that question is too obnoxious :) [16:41] edsu: Depends [16:42] wget is "standard" open source, so people working on that probably won't get paid, if they're not employed by a linux/unix distributor/reseller/consulting [16:43] jsmess is a port of mess, so I would guess it's similar [16:45] I guess if you're an archiving guru, you can do a lot of free work, and then get paid for speaking at conferences, doing consulting for businesses and writing books... [16:45] I ain't getting squat for jsmess [16:45] payment in the form of having better software, payment in the form of knowing that you've done something good [16:47] people have talked about soliciting money for mame/mess but it would be a shitstorm because so much of it is building on the work of other people it's hard to decide who deserves it [16:49] I pay BlueMax a few grand each month to annoy people and want to work for Archive Team just to drown him out. [16:49] It's genius. [16:50] if someone likes the apple II driver, is that attributable to lil ol me who did some js porting, r. belmont who improved the driver a lot, nate woods who worked on it in the past, wilbert pol who wrote the 6502, etc. etc. [16:51] edsu: I'm mostly trying to understand the question. [16:51] Is it a bog-standard "why do you all volunteer? why do you not get paid?" [16:52] I understood it purely as "do people get paid for this?" without the "why?" ;) [16:52] When alard was was doing day-in day-out massive amounts of archive team coding, I asked him if he wanted me to find some recompensation funding, and he wasn't interested. [16:53] undersco2, when he was just flushing out weeks at a time designing and hosting things, well, I got him hired at the archive [16:53] DFJustin: Would it be the same level of shitstorm if people could reward specific developers? [16:54] yes because the people the public are most familiar with from blogging etc. are not necessarily doing the most work behind the scenes [16:55] Well, bear in mind, Aaron Giles worked on Bleem, and SoftPC before ever coming into contact with MAME/MESS. [16:55] also the whole bit about copying old software without permission is dubious enough when nobody is getting paid [16:57] nobody wants to introduce that element into it [16:58] ha ha [17:01] SketchCow: i was listening to your talk at opensource bridge the other day, and heard "I have a developer working on wget" or something to that effect [17:02] SketchCow: i was just wondering what that actually meant :) [17:02] The developer was Alard [17:02] In that talk, I ask for anyone in the audience to help with JSMESS. [17:02] Nobody there did. [17:02] So there's a lot of asking to get the small handful of folks. [17:03] yeah :( [17:03] so alard was volunteering? [17:04] do people generally have jobs that allow them to contribute in some form? [17:04] or at least not get in the way? [17:04] or is it a cognitive surplus type of thing, where people have time away from work to contribute? [17:05] just wondering if you've noticed any patterns [17:05] sorry, i'm obv making up this question as i go ... [17:09] for me it's currently "whenever I have time [17:09] " [17:11] It can be like any other hobby. You come home after work, then you work on this. [17:11] for me it is: I get distracted too easily to participate in everything I am interested in. [17:12] Obviously if you have a family or wife it gets more difficult. ;) [17:17] Is this question for something? [17:23] i guess if you're just firing up the warrior and letting it do its thing you don't have to worry about it too much [17:24] but software takes time/effort typically to write ; and sometimes costs money and more time/effort to keep running [17:25] SketchCow: yeah, i think i told you yesterday i'm goig to be talking about web preservation type of stuff at ndf 2013 http://www.ndf.org.nz/ and I plan on highlighting the work of archiveteam [17:25] s/goig/going/ [17:28] i also work at loc.gov ; and would like to encourage people who work in libraries/archives etc to participate more ; and for managers at these institutions to support it [17:29] mostly out of a guilty conscience for not doing it myself ... [17:57] https://www.everpix.com/landing.html <- everpix is shutting down [17:58] just came here to paste that :O [17:58] Thanks. [17:59] It's true! [17:59] read-only until 12/15 [18:08] edsu: So, the situation is one I've spent time on. As I've now gotten to work with "professionals" and "volunteers", I generally like working with volunteers. I have had nothing but trouble with self-identified professionals. [18:08] Volunteers who don't self-identify as professionals, that is, volunteers :) I am fine with. [18:09] I find that professionals have long moved away from doing volunteer projects in their field because their lives are filled with enough sadness just trying to get things pushed through the cheesecloth of ossified organizations. [18:09] Archive Team provides a clear mission, purpose, and results. [18:11] So a volunteer does not feel they have to ramp up to some internal Neverland of mores and needs. [18:12] Occasionally, we get someone who comes in who wants to utterly upset fundamentals, but those types of folks barely last anywhere, so they either contribute or quickly move on to making raytraced minecraft servers or mining bitcoins with a game cube. [18:16] People can do a small job, a big job, or a job that is easier. [18:16] I've tried to work with people to break some of our more mundane aspects into automation. [18:17] Hence the Archive Bot, which does the most boring thing (given a site, save it using the methods that work best with Internet Archive's wayback machine) [18:17] That just kind of works. [18:17] We're always going to need specific help with things, but absorbing people's lives day in and day out with archive team stuff is just going to burn people out. [18:17] Any. Other. Questions. [18:44] does anyone back up .ipa's from the app store? https://itunes.apple.com/us/app/everpix/id480052550 [18:44] might as well back up every other app as well ;) [18:44] I wish [18:44] I hope someone is. [19:08] SketchCow: thanks ; if you have the bandwidth for more questions, i'd be curious to hear what you make of the perma.cc project [19:09] fwiw, i agree re: volunteers, many a library/archive has been held together by volunteers [19:16] Found this gem today: [19:16] NOTICE: The National Dissemination Center for Children with Disabilities (NICHCY) is no longer in operation. Our funding from the U.S. Department of Educationâs Office of Special Education Programs (OSEP) ended on September 30, 2013. Our website and all its free resources will remain available until September 30, 2014. [19:17] Luckily, mama has a Digital Ocean server with wget. [19:23] archivebot is grabbing it [19:27] Oh, thanks! But I do have server space to do the grab, if you want to keep archivebot available for other projects...? [19:27] What is archivebot's capacity, in terms of number of sites grabbed and amount of disk space? [19:28] it has 40GB of space but completed jobs are offloaded to huge IA servers [19:29] so that's more of a maximum running job size [19:31] exceeding your storage capacity may lead to synaptic seepage, which will make coherent download impossible [19:31] How is IA metadata added to the competed jobs? Are they automatically put into the "Archive Team" bucket? [19:32] sketchcow shoves them into buckets periodically [19:50] Do we have an archive script optimised for MediaWiki? [19:51] Go to #wikiteam [19:51] http://archiveteam.org/index.php?title=Wikiteam#Tools_and_source_code [19:51] not warc though afaik [19:52] dumping wikis using a crawler is generally a bad idea [19:57] someone grab the frontpage of foxnews.com into warc [19:59] nvm, wayback has a saver thinger [19:59] I need a script at the ready for those kinds of situations. warc_page [20:01] https://web.archive.org/web/20131105195906/http://www.foxnews.com/ [20:01] guess you got it too [20:01] yeah [20:02] it's gone now [20:20] haha [20:20] what happened to foxnews [20:22] oh wow [21:24] foxnews problem is on theblaze too: http://www.theblaze.com/stories/2013/11/05/weeeeeeeeeee-foxnews-com-apparently-hacked/ [21:24] hacked? eh [21:25] looked like a lazy intern to me [21:25] bored, and probably now fired intern [21:25] âDuring routine website maintenance, a home page prototype was accidentally moved to the actual site. As with any mistake in testing, engineers noticed the error and quickly brought the site back to its normal function,âÂ Chief Digital Officer Jeff Misenti said in a statement. [21:26] ah, so that is the meaning of hacked nowadays [21:26] why on earth would you have that as a "test" page [21:26] * touya sighs [21:26] it's a placeholder [21:26] editor will choose final headline/subtitle [21:31] shit shit shit emergency, is there a ready script/tool to rescue a site (blogspot.com blog) from google cache or similar caches? [21:35] While investigating the impossibility of doing that, I came across a service that claims to be able to recover about 100 pages. [21:35] hm, thats not worth it. i can do that by hand :) [21:36] apparently google may blacklist you after only 20 hits, so you may need like 5 IPs to get 100 pages. [21:37] reminder to self: grep fullsize images when done, they seem still online [21:38] hmm, I wonder if the common crawl dataset can be used for this type of thing? [21:40] spiritt: try to see if google cache or others have a cached version of the atom.xml file of the blog? [21:41] i am currently grabbing the archive pages, almost done then i can relax [21:44] heh, archive pages done, 2 pages later i get banned [21:44] Nemo_bis, actually, good call. i have that blog in newsbeuter since its beginnings.:) [21:45] alright, crisis reduced. [21:45] i'm gonna reset my router to get a new ip so until next time. cheers! [21:46] never really tried http://warrick.cs.odu.edu/ before ; i know they do something w/ content in google cache, and other web archives [21:47] oh, spiritt is gone, like, uh, an actual spirit ... [21:51] isn't the spirit supposed to be eternal [21:51] Nemo_bis: details