#archiveteam 2013-11-05,Tue

↑back Search

Time Nickname Message
00:20 πŸ”— n00b762 Hey guys just a thought but I noticed something today, archive.org now gives you the option to archive any page if you put it into the waybackmachine and it is not already archived and also auto archives various pages if they are linked to from within a archived page but not archived themselves, so would it be possible to write a script to use this to expand the archive.org archives as well, more or less
00:20 πŸ”— n00b762 them if the option to archive them pops up
00:21 πŸ”— Sum1 I saw that, it's a real nice-to-have.
00:22 πŸ”— n00b762 Indeed, if we had had it during the isohunt bit the entire site could have been archived in a few hours
00:23 πŸ”— n00b762 Of course the archive would then be on archive.org servers, but you could just write a script to then go to all the pages archived there and make a copy, would probably take longer in the long run since the pages would be archived twice, but for emergency archiving it would make for quick easy work
00:30 πŸ”— SketchCow Wrong.
00:30 πŸ”— SketchCow I mean, nice thought, but wrong.
00:30 πŸ”— SketchCow 1. DDOS the archive.org server with per-page hits on sites, and it will shut you out.
00:30 πŸ”— SketchCow 2. Archive.org follows rules archive team does not.
00:31 πŸ”— SketchCow 3. Archive.org digs but not comprehensively, not to the ends of the server's earth.
00:32 πŸ”— SketchCow They serve different purposes.
00:33 πŸ”— Sum1 IIRC when browsing the Wayback Machine it's not possible to click a link a have it a find matching archived page from other dates.
01:37 πŸ”— xmc it is
01:38 πŸ”— xmc ish
01:38 πŸ”— xmc replace the date in the url with *
02:16 πŸ”— Lord_Nigh because of university policy moving stuff to bbvista/webct, anything under http://www.ece.drexel.edu/courses/ is unlikely to be around for more than a couple more years. no immediate delete date. Ia already has most of the interesting stuff though, so i dunno if anything really needs to be done
02:18 πŸ”— Lord_Nigh http://usesold.com/ <- merged to dropbox, site probably closing soon
02:22 πŸ”— ivan` archivebot has grabbed usesold.com
02:22 πŸ”— ivan` Lord_Nigh: do you know if /courses/ is > 40GB?
02:23 πŸ”— Lord_Nigh no idea. it has no index though, that i know of
02:23 πŸ”— Lord_Nigh ... ok, i'm surprised
02:23 πŸ”— Lord_Nigh i never actuallt tried the link itself
02:24 πŸ”— Lord_Nigh but it does have an index
02:24 πŸ”— Lord_Nigh not sure how large it is tbh
02:24 πŸ”— Lord_Nigh i don't have warc set up here properly to grab it or i would
02:26 πŸ”— ivan` I started a local grab, hopefully it is small; I'll make archivebot grab it if it is
02:27 πŸ”— phillipsj PPT slides aren't *that* big :)
02:35 πŸ”— SketchCow So, unless there's a specific reason otherwise, I'd prefer if Archivebot grabbed everything.
02:46 πŸ”— ivan` small = <40GB, that is small enough not to break my userscripts grab ;)
02:49 πŸ”— Lord_Nigh judging by how fast this is going i doubt this is even 2gb
03:43 πŸ”— ivan` it's 998MB
16:25 πŸ”— edsu SketchCow: this might not sound right ; but for the developer(s) working on wget/warc, jsmess, etc, is retribution purely in the form of respect from your peers? or do people get paid at all?
16:25 πŸ”— edsu s/right/appropriate/
16:26 πŸ”— edsu i guess being aligned with internet archive might have some benefits, since they are a non-profit that you can donate to?
16:27 πŸ”— edsu feel free to ignore me if that question is too obnoxious :)
16:41 πŸ”— Tomcat_ edsu: Depends
16:42 πŸ”— Tomcat_ wget is "standard" open source, so people working on that probably won't get paid, if they're not employed by a linux/unix distributor/reseller/consulting
16:43 πŸ”— Tomcat_ jsmess is a port of mess, so I would guess it's similar
16:45 πŸ”— Tomcat_ I guess if you're an archiving guru, you can do a lot of free work, and then get paid for speaking at conferences, doing consulting for businesses and writing books...
16:45 πŸ”— DFJustin I ain't getting squat for jsmess
16:45 πŸ”— xmc payment in the form of having better software, payment in the form of knowing that you've done something good
16:47 πŸ”— DFJustin people have talked about soliciting money for mame/mess but it would be a shitstorm because so much of it is building on the work of other people it's hard to decide who deserves it
16:49 πŸ”— SketchCow I pay BlueMax a few grand each month to annoy people and want to work for Archive Team just to drown him out.
16:49 πŸ”— SketchCow It's genius.
16:50 πŸ”— DFJustin if someone likes the apple II driver, is that attributable to lil ol me who did some js porting, r. belmont who improved the driver a lot, nate woods who worked on it in the past, wilbert pol who wrote the 6502, etc. etc.
16:51 πŸ”— SketchCow edsu: I'm mostly trying to understand the question.
16:51 πŸ”— SketchCow Is it a bog-standard "why do you all volunteer? why do you not get paid?"
16:52 πŸ”— Tomcat_ I understood it purely as "do people get paid for this?" without the "why?" ;)
16:52 πŸ”— SketchCow When alard was was doing day-in day-out massive amounts of archive team coding, I asked him if he wanted me to find some recompensation funding, and he wasn't interested.
16:53 πŸ”— SketchCow undersco2, when he was just flushing out weeks at a time designing and hosting things, well, I got him hired at the archive
16:53 πŸ”— Tomcat_ DFJustin: Would it be the same level of shitstorm if people could reward specific developers?
16:54 πŸ”— DFJustin yes because the people the public are most familiar with from blogging etc. are not necessarily doing the most work behind the scenes
16:55 πŸ”— SketchCow Well, bear in mind, Aaron Giles worked on Bleem, and SoftPC before ever coming into contact with MAME/MESS.
16:55 πŸ”— DFJustin also the whole bit about copying old software without permission is dubious enough when nobody is getting paid
16:57 πŸ”— DFJustin nobody wants to introduce that element into it
16:58 πŸ”— SketchCow ha ha
17:01 πŸ”— edsu SketchCow: i was listening to your talk at opensource bridge the other day, and heard "I have a developer working on wget" or something to that effect
17:02 πŸ”— edsu SketchCow: i was just wondering what that actually meant :)
17:02 πŸ”— SketchCow The developer was Alard
17:02 πŸ”— SketchCow In that talk, I ask for anyone in the audience to help with JSMESS.
17:02 πŸ”— SketchCow Nobody there did.
17:02 πŸ”— SketchCow So there's a lot of asking to get the small handful of folks.
17:03 πŸ”— edsu yeah :(
17:03 πŸ”— edsu so alard was volunteering?
17:04 πŸ”— edsu do people generally have jobs that allow them to contribute in some form?
17:04 πŸ”— edsu or at least not get in the way?
17:04 πŸ”— edsu or is it a cognitive surplus type of thing, where people have time away from work to contribute?
17:05 πŸ”— edsu just wondering if you've noticed any patterns
17:05 πŸ”— edsu sorry, i'm obv making up this question as i go ...
17:09 πŸ”— joepie93 for me it's currently "whenever I have time
17:09 πŸ”— joepie93 "
17:11 πŸ”— Tomcat_ It can be like any other hobby. You come home after work, then you work on this.
17:11 πŸ”— phillipsj for me it is: I get distracted too easily to participate in everything I am interested in.
17:12 πŸ”— Tomcat_ Obviously if you have a family or wife it gets more difficult. ;)
17:17 πŸ”— SketchCow Is this question for something?
17:23 πŸ”— edsu i guess if you're just firing up the warrior and letting it do its thing you don't have to worry about it too much
17:24 πŸ”— edsu but software takes time/effort typically to write ; and sometimes costs money and more time/effort to keep running
17:25 πŸ”— edsu SketchCow: yeah, i think i told you yesterday i'm goig to be talking about web preservation type of stuff at ndf 2013 http://www.ndf.org.nz/ and I plan on highlighting the work of archiveteam
17:25 πŸ”— edsu s/goig/going/
17:28 πŸ”— edsu i also work at loc.gov ; and would like to encourage people who work in libraries/archives etc to participate more ; and for managers at these institutions to support it
17:29 πŸ”— edsu mostly out of a guilty conscience for not doing it myself ...
17:57 πŸ”— habi https://www.everpix.com/landing.html <- everpix is shutting down
17:58 πŸ”— pft just came here to paste that :O
17:58 πŸ”— SketchCow Thanks.
17:59 πŸ”— SketchCow It's true!
17:59 πŸ”— pft read-only until 12/15
18:08 πŸ”— SketchCow edsu: So, the situation is one I've spent time on. As I've now gotten to work with "professionals" and "volunteers", I generally like working with volunteers. I have had nothing but trouble with self-identified professionals.
18:08 πŸ”— SketchCow Volunteers who don't self-identify as professionals, that is, volunteers :) I am fine with.
18:09 πŸ”— SketchCow I find that professionals have long moved away from doing volunteer projects in their field because their lives are filled with enough sadness just trying to get things pushed through the cheesecloth of ossified organizations.
18:09 πŸ”— SketchCow Archive Team provides a clear mission, purpose, and results.
18:11 πŸ”— SketchCow So a volunteer does not feel they have to ramp up to some internal Neverland of mores and needs.
18:12 πŸ”— SketchCow Occasionally, we get someone who comes in who wants to utterly upset fundamentals, but those types of folks barely last anywhere, so they either contribute or quickly move on to making raytraced minecraft servers or mining bitcoins with a game cube.
18:16 πŸ”— SketchCow People can do a small job, a big job, or a job that is easier.
18:16 πŸ”— SketchCow I've tried to work with people to break some of our more mundane aspects into automation.
18:17 πŸ”— SketchCow Hence the Archive Bot, which does the most boring thing (given a site, save it using the methods that work best with Internet Archive's wayback machine)
18:17 πŸ”— SketchCow That just kind of works.
18:17 πŸ”— SketchCow We're always going to need specific help with things, but absorbing people's lives day in and day out with archive team stuff is just going to burn people out.
18:17 πŸ”— SketchCow Any. Other. Questions.
18:44 πŸ”— ivan` does anyone back up .ipa's from the app store? https://itunes.apple.com/us/app/everpix/id480052550
18:44 πŸ”— ivan` might as well back up every other app as well ;)
18:44 πŸ”— SketchCow I wish
18:44 πŸ”— SketchCow I hope someone is.
19:08 πŸ”— edsu SketchCow: thanks ; if you have the bandwidth for more questions, i'd be curious to hear what you make of the perma.cc project
19:09 πŸ”— edsu fwiw, i agree re: volunteers, many a library/archive has been held together by volunteers
19:16 πŸ”— Asparagir Found this gem today:
19:16 πŸ”— Asparagir NOTICE: The National Dissemination Center for Children with Disabilities (NICHCY) is no longer in operation. Our funding from the U.S. Department of EducationҀ™s Office of Special Education Programs (OSEP) ended on September 30, 2013. Our website and all its free resources will remain available until September 30, 2014.
19:17 πŸ”— Asparagir Luckily, mama has a Digital Ocean server with wget.
19:23 πŸ”— ivan` archivebot is grabbing it
19:27 πŸ”— Asparagir Oh, thanks! But I do have server space to do the grab, if you want to keep archivebot available for other projects...?
19:27 πŸ”— Asparagir What is archivebot's capacity, in terms of number of sites grabbed and amount of disk space?
19:28 πŸ”— DFJustin it has 40GB of space but completed jobs are offloaded to huge IA servers
19:29 πŸ”— DFJustin so that's more of a maximum running job size
19:31 πŸ”— yipdw exceeding your storage capacity may lead to synaptic seepage, which will make coherent download impossible
19:31 πŸ”— Asparagir How is IA metadata added to the competed jobs? Are they automatically put into the "Archive Team" bucket?
19:32 πŸ”— DFJustin sketchcow shoves them into buckets periodically
19:50 πŸ”— w0rp Do we have an archive script optimised for MediaWiki?
19:51 πŸ”— SketchCow Go to #wikiteam
19:51 πŸ”— DFJustin http://archiveteam.org/index.php?title=Wikiteam#Tools_and_source_code
19:51 πŸ”— DFJustin not warc though afaik
19:52 πŸ”— balrog dumping wikis using a crawler is generally a bad idea
19:57 πŸ”— xmc someone grab the frontpage of foxnews.com into warc
19:59 πŸ”— xmc nvm, wayback has a saver thinger
19:59 πŸ”— w0rp I need a script at the ready for those kinds of situations. warc_page
20:01 πŸ”— DFJustin https://web.archive.org/web/20131105195906/http://www.foxnews.com/
20:01 πŸ”— DFJustin guess you got it too
20:01 πŸ”— xmc yeah
20:02 πŸ”— xmc it's gone now
20:20 πŸ”— yipdw haha
20:20 πŸ”— yipdw what happened to foxnews
20:22 πŸ”— yipdw oh wow
21:24 πŸ”— godane foxnews problem is on theblaze too: http://www.theblaze.com/stories/2013/11/05/weeeeeeeeeee-foxnews-com-apparently-hacked/
21:24 πŸ”— touya hacked? eh
21:25 πŸ”— touya looked like a lazy intern to me
21:25 πŸ”— BiggieJo1 bored, and probably now fired intern
21:25 πŸ”— godane Γ’Β€ΒœDuring routine website maintenance, a home page prototype was accidentally moved to the actual site. As with any mistake in testing, engineers noticed the error and quickly brought the site back to its normal function,ҀÂ Chief Digital Officer Jeff Misenti said in a statement.
21:26 πŸ”— touya ah, so that is the meaning of hacked nowadays
21:26 πŸ”— BiggieJo1 why on earth would you have that as a "test" page
21:26 πŸ”— * touya sighs
21:26 πŸ”— touya it's a placeholder
21:26 πŸ”— touya editor will choose final headline/subtitle
21:31 πŸ”— spiritt shit shit shit emergency, is there a ready script/tool to rescue a site (blogspot.com blog) from google cache or similar caches?
21:35 πŸ”— phillipsj While investigating the impossibility of doing that, I came across a service that claims to be able to recover about 100 pages.
21:35 πŸ”— spiritt hm, thats not worth it. i can do that by hand :)
21:36 πŸ”— phillipsj apparently google may blacklist you after only 20 hits, so you may need like 5 IPs to get 100 pages.
21:37 πŸ”— spiritt reminder to self: grep fullsize images when done, they seem still online
21:38 πŸ”— phillipsj hmm, I wonder if the common crawl dataset can be used for this type of thing?
21:40 πŸ”— Nemo_bis spiritt: try to see if google cache or others have a cached version of the atom.xml file of the blog?
21:41 πŸ”— spiritt i am currently grabbing the archive pages, almost done then i can relax
21:44 πŸ”— spiritt heh, archive pages done, 2 pages later i get banned
21:44 πŸ”— spiritt Nemo_bis, actually, good call. i have that blog in newsbeuter since its beginnings.:)
21:45 πŸ”— spiritt alright, crisis reduced.
21:45 πŸ”— spiritt i'm gonna reset my router to get a new ip so until next time. cheers!
21:46 πŸ”— edsu never really tried http://warrick.cs.odu.edu/ before ; i know they do something w/ content in google cache, and other web archives
21:47 πŸ”— edsu oh, spiritt is gone, like, uh, an actual spirit ...
21:51 πŸ”— Nemo_bis isn't the spirit supposed to be eternal
21:51 πŸ”— edsu Nemo_bis: details

irclogger-viewer