#archiveteam 2013-11-05,Tue

↑back Search

Time	Nickname	Message
00:20 ^🔗	n00b762	Hey guys just a thought but I noticed something today, archive.org now gives you the option to archive any page if you put it into the waybackmachine and it is not already archived and also auto archives various pages if they are linked to from within a archived page but not archived themselves, so would it be possible to write a script to use this to expand the archive.org archives as well, more or less
00:20 ^🔗	n00b762	them if the option to archive them pops up
00:21 ^🔗	Sum1	I saw that, it's a real nice-to-have.
00:22 ^🔗	n00b762	Indeed, if we had had it during the isohunt bit the entire site could have been archived in a few hours
00:23 ^🔗	n00b762	Of course the archive would then be on archive.org servers, but you could just write a script to then go to all the pages archived there and make a copy, would probably take longer in the long run since the pages would be archived twice, but for emergency archiving it would make for quick easy work
00:30 ^🔗	SketchCow	Wrong.
00:30 ^🔗	SketchCow	I mean, nice thought, but wrong.
00:30 ^🔗	SketchCow	1. DDOS the archive.org server with per-page hits on sites, and it will shut you out.
00:30 ^🔗	SketchCow	2. Archive.org follows rules archive team does not.
00:31 ^🔗	SketchCow	3. Archive.org digs but not comprehensively, not to the ends of the server's earth.
00:32 ^🔗	SketchCow	They serve different purposes.
00:33 ^🔗	Sum1	IIRC when browsing the Wayback Machine it's not possible to click a link a have it a find matching archived page from other dates.
01:37 ^🔗	xmc	it is
01:38 ^🔗	xmc	ish
01:38 ^🔗	xmc	replace the date in the url with *
02:16 ^🔗	Lord_Nigh	because of university policy moving stuff to bbvista/webct, anything under http://www.ece.drexel.edu/courses/ is unlikely to be around for more than a couple more years. no immediate delete date. Ia already has most of the interesting stuff though, so i dunno if anything really needs to be done
02:18 ^🔗	Lord_Nigh	http://usesold.com/ <- merged to dropbox, site probably closing soon
02:22 ^🔗	ivan`	archivebot has grabbed usesold.com
02:22 ^🔗	ivan`	Lord_Nigh: do you know if /courses/ is > 40GB?
02:23 ^🔗	Lord_Nigh	no idea. it has no index though, that i know of
02:23 ^🔗	Lord_Nigh	... ok, i'm surprised
02:23 ^🔗	Lord_Nigh	i never actuallt tried the link itself
02:24 ^🔗	Lord_Nigh	but it does have an index
02:24 ^🔗	Lord_Nigh	not sure how large it is tbh
02:24 ^🔗	Lord_Nigh	i don't have warc set up here properly to grab it or i would
02:26 ^🔗	ivan`	I started a local grab, hopefully it is small; I'll make archivebot grab it if it is
02:27 ^🔗	phillipsj	PPT slides aren't that big :)
02:35 ^🔗	SketchCow	So, unless there's a specific reason otherwise, I'd prefer if Archivebot grabbed everything.
02:46 ^🔗	ivan`	small = <40GB, that is small enough not to break my userscripts grab ;)
02:49 ^🔗	Lord_Nigh	judging by how fast this is going i doubt this is even 2gb
03:43 ^🔗	ivan`	it's 998MB
16:25 ^🔗	edsu	SketchCow: this might not sound right ; but for the developer(s) working on wget/warc, jsmess, etc, is retribution purely in the form of respect from your peers? or do people get paid at all?
16:25 ^🔗	edsu	s/right/appropriate/
16:26 ^🔗	edsu	i guess being aligned with internet archive might have some benefits, since they are a non-profit that you can donate to?
16:27 ^🔗	edsu	feel free to ignore me if that question is too obnoxious :)
16:41 ^🔗	Tomcat_	edsu: Depends
16:42 ^🔗	Tomcat_	wget is "standard" open source, so people working on that probably won't get paid, if they're not employed by a linux/unix distributor/reseller/consulting
16:43 ^🔗	Tomcat_	jsmess is a port of mess, so I would guess it's similar
16:45 ^🔗	Tomcat_	I guess if you're an archiving guru, you can do a lot of free work, and then get paid for speaking at conferences, doing consulting for businesses and writing books...
16:45 ^🔗	DFJustin	I ain't getting squat for jsmess
16:45 ^🔗	xmc	payment in the form of having better software, payment in the form of knowing that you've done something good
16:47 ^🔗	DFJustin	people have talked about soliciting money for mame/mess but it would be a shitstorm because so much of it is building on the work of other people it's hard to decide who deserves it
16:49 ^🔗	SketchCow	I pay BlueMax a few grand each month to annoy people and want to work for Archive Team just to drown him out.
16:49 ^🔗	SketchCow	It's genius.
16:50 ^🔗	DFJustin	if someone likes the apple II driver, is that attributable to lil ol me who did some js porting, r. belmont who improved the driver a lot, nate woods who worked on it in the past, wilbert pol who wrote the 6502, etc. etc.
16:51 ^🔗	SketchCow	edsu: I'm mostly trying to understand the question.
16:51 ^🔗	SketchCow	Is it a bog-standard "why do you all volunteer? why do you not get paid?"
16:52 ^🔗	Tomcat_	I understood it purely as "do people get paid for this?" without the "why?" ;)
16:52 ^🔗	SketchCow	When alard was was doing day-in day-out massive amounts of archive team coding, I asked him if he wanted me to find some recompensation funding, and he wasn't interested.
16:53 ^🔗	SketchCow	undersco2, when he was just flushing out weeks at a time designing and hosting things, well, I got him hired at the archive
16:53 ^🔗	Tomcat_	DFJustin: Would it be the same level of shitstorm if people could reward specific developers?
16:54 ^🔗	DFJustin	yes because the people the public are most familiar with from blogging etc. are not necessarily doing the most work behind the scenes
16:55 ^🔗	SketchCow	Well, bear in mind, Aaron Giles worked on Bleem, and SoftPC before ever coming into contact with MAME/MESS.
16:55 ^🔗	DFJustin	also the whole bit about copying old software without permission is dubious enough when nobody is getting paid
16:57 ^🔗	DFJustin	nobody wants to introduce that element into it
16:58 ^🔗	SketchCow	ha ha
17:01 ^🔗	edsu	SketchCow: i was listening to your talk at opensource bridge the other day, and heard "I have a developer working on wget" or something to that effect
17:02 ^🔗	edsu	SketchCow: i was just wondering what that actually meant :)
17:02 ^🔗	SketchCow	The developer was Alard
17:02 ^🔗	SketchCow	In that talk, I ask for anyone in the audience to help with JSMESS.
17:02 ^🔗	SketchCow	Nobody there did.
17:02 ^🔗	SketchCow	So there's a lot of asking to get the small handful of folks.
17:03 ^🔗	edsu	yeah :(
17:03 ^🔗	edsu	so alard was volunteering?
17:04 ^🔗	edsu	do people generally have jobs that allow them to contribute in some form?
17:04 ^🔗	edsu	or at least not get in the way?
17:04 ^🔗	edsu	or is it a cognitive surplus type of thing, where people have time away from work to contribute?
17:05 ^🔗	edsu	just wondering if you've noticed any patterns
17:05 ^🔗	edsu	sorry, i'm obv making up this question as i go ...
17:09 ^🔗	joepie93	for me it's currently "whenever I have time
17:09 ^🔗	joepie93	"
17:11 ^🔗	Tomcat_	It can be like any other hobby. You come home after work, then you work on this.
17:11 ^🔗	phillipsj	for me it is: I get distracted too easily to participate in everything I am interested in.
17:12 ^🔗	Tomcat_	Obviously if you have a family or wife it gets more difficult. ;)
17:17 ^🔗	SketchCow	Is this question for something?
17:23 ^🔗	edsu	i guess if you're just firing up the warrior and letting it do its thing you don't have to worry about it too much
17:24 ^🔗	edsu	but software takes time/effort typically to write ; and sometimes costs money and more time/effort to keep running
17:25 ^🔗	edsu	SketchCow: yeah, i think i told you yesterday i'm goig to be talking about web preservation type of stuff at ndf 2013 http://www.ndf.org.nz/ and I plan on highlighting the work of archiveteam
17:25 ^🔗	edsu	s/goig/going/
17:28 ^🔗	edsu	i also work at loc.gov ; and would like to encourage people who work in libraries/archives etc to participate more ; and for managers at these institutions to support it
17:29 ^🔗	edsu	mostly out of a guilty conscience for not doing it myself ...
17:57 ^🔗	habi	https://www.everpix.com/landing.html <- everpix is shutting down
17:58 ^🔗	pft	just came here to paste that :O
17:58 ^🔗	SketchCow	Thanks.
17:59 ^🔗	SketchCow	It's true!
17:59 ^🔗	pft	read-only until 12/15
18:08 ^🔗	SketchCow	edsu: So, the situation is one I've spent time on. As I've now gotten to work with "professionals" and "volunteers", I generally like working with volunteers. I have had nothing but trouble with self-identified professionals.
18:08 ^🔗	SketchCow	Volunteers who don't self-identify as professionals, that is, volunteers :) I am fine with.
18:09 ^🔗	SketchCow	I find that professionals have long moved away from doing volunteer projects in their field because their lives are filled with enough sadness just trying to get things pushed through the cheesecloth of ossified organizations.
18:09 ^🔗	SketchCow	Archive Team provides a clear mission, purpose, and results.
18:11 ^🔗	SketchCow	So a volunteer does not feel they have to ramp up to some internal Neverland of mores and needs.
18:12 ^🔗	SketchCow	Occasionally, we get someone who comes in who wants to utterly upset fundamentals, but those types of folks barely last anywhere, so they either contribute or quickly move on to making raytraced minecraft servers or mining bitcoins with a game cube.
18:16 ^🔗	SketchCow	People can do a small job, a big job, or a job that is easier.
18:16 ^🔗	SketchCow	I've tried to work with people to break some of our more mundane aspects into automation.
18:17 ^🔗	SketchCow	Hence the Archive Bot, which does the most boring thing (given a site, save it using the methods that work best with Internet Archive's wayback machine)
18:17 ^🔗	SketchCow	That just kind of works.
18:17 ^🔗	SketchCow	We're always going to need specific help with things, but absorbing people's lives day in and day out with archive team stuff is just going to burn people out.
18:17 ^🔗	SketchCow	Any. Other. Questions.
18:44 ^🔗	ivan`	does anyone back up .ipa's from the app store? https://itunes.apple.com/us/app/everpix/id480052550
18:44 ^🔗	ivan`	might as well back up every other app as well ;)
18:44 ^🔗	SketchCow	I wish
18:44 ^🔗	SketchCow	I hope someone is.
19:08 ^🔗	edsu	SketchCow: thanks ; if you have the bandwidth for more questions, i'd be curious to hear what you make of the perma.cc project
19:09 ^🔗	edsu	fwiw, i agree re: volunteers, many a library/archive has been held together by volunteers
19:16 ^🔗	Asparagir	Found this gem today:
19:16 ^🔗	Asparagir	NOTICE: The National Dissemination Center for Children with Disabilities (NICHCY) is no longer in operation. Our funding from the U.S. Department of Educationâs Office of Special Education Programs (OSEP) ended on September 30, 2013. Our website and all its free resources will remain available until September 30, 2014.
19:17 ^🔗	Asparagir	Luckily, mama has a Digital Ocean server with wget.
19:23 ^🔗	ivan`	archivebot is grabbing it
19:27 ^🔗	Asparagir	Oh, thanks! But I do have server space to do the grab, if you want to keep archivebot available for other projects...?
19:27 ^🔗	Asparagir	What is archivebot's capacity, in terms of number of sites grabbed and amount of disk space?
19:28 ^🔗	DFJustin	it has 40GB of space but completed jobs are offloaded to huge IA servers
19:29 ^🔗	DFJustin	so that's more of a maximum running job size
19:31 ^🔗	yipdw	exceeding your storage capacity may lead to synaptic seepage, which will make coherent download impossible
19:31 ^🔗	Asparagir	How is IA metadata added to the competed jobs? Are they automatically put into the "Archive Team" bucket?
19:32 ^🔗	DFJustin	sketchcow shoves them into buckets periodically
19:50 ^🔗	w0rp	Do we have an archive script optimised for MediaWiki?
19:51 ^🔗	SketchCow	Go to #wikiteam
19:51 ^🔗	DFJustin	http://archiveteam.org/index.php?title=Wikiteam#Tools_and_source_code
19:51 ^🔗	DFJustin	not warc though afaik
19:52 ^🔗	balrog	dumping wikis using a crawler is generally a bad idea
19:57 ^🔗	xmc	someone grab the frontpage of foxnews.com into warc
19:59 ^🔗	xmc	nvm, wayback has a saver thinger
19:59 ^🔗	w0rp	I need a script at the ready for those kinds of situations. warc_page
20:01 ^🔗	DFJustin	https://web.archive.org/web/20131105195906/http://www.foxnews.com/
20:01 ^🔗	DFJustin	guess you got it too
20:01 ^🔗	xmc	yeah
20:02 ^🔗	xmc	it's gone now
20:20 ^🔗	yipdw	haha
20:20 ^🔗	yipdw	what happened to foxnews
20:22 ^🔗	yipdw	oh wow
21:24 ^🔗	godane	foxnews problem is on theblaze too: http://www.theblaze.com/stories/2013/11/05/weeeeeeeeeee-foxnews-com-apparently-hacked/
21:24 ^🔗	touya	hacked? eh
21:25 ^🔗	touya	looked like a lazy intern to me
21:25 ^🔗	BiggieJo1	bored, and probably now fired intern
21:25 ^🔗	godane	âDuring routine website maintenance, a home page prototype was accidentally moved to the actual site. As with any mistake in testing, engineers noticed the error and quickly brought the site back to its normal function,âÂ Chief Digital Officer Jeff Misenti said in a statement.
21:26 ^🔗	touya	ah, so that is the meaning of hacked nowadays
21:26 ^🔗	BiggieJo1	why on earth would you have that as a "test" page
21:26 ^🔗	*	touya sighs
21:26 ^🔗	touya	it's a placeholder
21:26 ^🔗	touya	editor will choose final headline/subtitle
21:31 ^🔗	spiritt	shit shit shit emergency, is there a ready script/tool to rescue a site (blogspot.com blog) from google cache or similar caches?
21:35 ^🔗	phillipsj	While investigating the impossibility of doing that, I came across a service that claims to be able to recover about 100 pages.
21:35 ^🔗	spiritt	hm, thats not worth it. i can do that by hand :)
21:36 ^🔗	phillipsj	apparently google may blacklist you after only 20 hits, so you may need like 5 IPs to get 100 pages.
21:37 ^🔗	spiritt	reminder to self: grep fullsize images when done, they seem still online
21:38 ^🔗	phillipsj	hmm, I wonder if the common crawl dataset can be used for this type of thing?
21:40 ^🔗	Nemo_bis	spiritt: try to see if google cache or others have a cached version of the atom.xml file of the blog?
21:41 ^🔗	spiritt	i am currently grabbing the archive pages, almost done then i can relax
21:44 ^🔗	spiritt	heh, archive pages done, 2 pages later i get banned
21:44 ^🔗	spiritt	Nemo_bis, actually, good call. i have that blog in newsbeuter since its beginnings.:)
21:45 ^🔗	spiritt	alright, crisis reduced.
21:45 ^🔗	spiritt	i'm gonna reset my router to get a new ip so until next time. cheers!
21:46 ^🔗	edsu	never really tried http://warrick.cs.odu.edu/ before ; i know they do something w/ content in google cache, and other web archives
21:47 ^🔗	edsu	oh, spiritt is gone, like, uh, an actual spirit ...
21:51 ^🔗	Nemo_bis	isn't the spirit supposed to be eternal
21:51 ^🔗	edsu	Nemo_bis: details

irclogger-viewer