[00:13] *** xk_id has quit IRC (Read error: Connection reset by peer) [00:13] *** xk_id_ has joined #archiveteam [00:16] *** xk_id_ has quit IRC (Remote host closed the connection) [00:16] *** K4k has quit IRC (Read error: Operation timed out) [00:18] *** dashcloud has quit IRC (Ping timeout: 272 seconds) [00:28] *** dashcloud has joined #archiveteam [01:26] *** arbin has quit IRC (Read error: Connection reset by peer) [01:28] *** arbin has joined #archiveteam [01:30] *** __uu has joined #archiveteam [01:31] *** xk_id has joined #archiveteam [01:35] *** mistym_ has joined #archiveteam [01:40] *** __uu has quit IRC (Ping timeout: 265 seconds) [01:42] *** mistym has quit IRC (Read error: Operation timed out) [01:56] *** mistym_ has quit IRC (Remote host closed the connection) [02:03] *** dashcloud has quit IRC (Read error: Operation timed out) [02:09] *** __uu has joined #archiveteam [02:10] *** dashcloud has joined #archiveteam [02:11] *** __uu_ has joined #archiveteam [02:13] *** __uu_ has quit IRC (Client Quit) [02:24] *** philpem has quit IRC (Ping timeout: 272 seconds) [02:29] *** primus104 has quit IRC (Leaving.) [03:40] *** kyan_ has joined #archiveteam [03:40] *** godane has quit IRC (Ping timeout: 272 seconds) [03:42] *** kyan has quit IRC (Ping timeout: 258 seconds) [03:56] *** godane has joined #archiveteam [03:57] *** mib_0n6by has joined #archiveteam [03:57] Howdy - if I know of a website that is going down relatively soon, who do I talk to to possibly preserve it? [03:58] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [03:59] us i guess [03:59] what website is it? [04:01] *** Lord_Nigh has joined #archiveteam [04:01] talk in channel, that way more people can get involved if need be [04:02] also, greetings! and thanks for showing up [04:04] Sorry [04:04] kb.berkeley.edu [04:05] do you think it requires a lot of storage? [04:05] Mostly text and few images. [04:05] looks like mostly text [04:05] No large files. [04:06] looks like a job for archiveteambot [04:06] do you agree Ctrl-S ? [04:06] I have no idea [04:06] ah, ok [04:07] https://kb.berkeley.edu/page.php?id=23247 [04:07] Hmm? [04:07] https://kb.berkeley.edu/page.php?id=23243 [04:07] might be sequential numbering for articles? [04:08] It is, but updates to articles are not. [04:08] And there are a number of subsites. [04:08] I have added the site to a bot used for archiving [04:09] sine it's a university hosting it, might we be able to ask the admins about archiving it locally? [04:09] mib_0n6by: do you know when it will go offline? [04:09] Relatively soon. [04:09] might be able to mail a HDD? [04:09] Ctrl-S: it is a UCB page hosted by the University of Wisconsin. [04:10] Easier to simply grab a copy as the entire site shouldn't be that large. [04:10] I'm pretty clueless about these matters [04:10] Trust me when I say that the site is small enough to just grab as opposed to waiting on the University to provide a copy. [04:11] (which would be a low priority and would likely take longer than just wgetting the whole thing.) [04:11] yeah [04:11] local copy if often not the best choice [04:12] Archiving the site sooner is better than not. [04:12] I have added the site to an archiveing bot [04:13] Thank you :) [04:13] and thank you [04:13] any other sites/subsites you know of that might be in need of archiving? [04:14] From the University? [04:14] anywhere really [04:14] I contacted Jason Scott a while ago about a private torrent site. [04:15] berkeley.edu is undergoing a complete site redesign soon, which means everything currently there may no longer be available or completely broken in a few months (this is for the main site only, departmental subsites are a different affair.) [04:16] I guess that means *.berkely needs archiving [04:16] Unknown ETA on the site change... [04:16] *** Silent700 has left [04:17] How do you guys handle overlap with the Archive.org WayBackMachine? [04:17] mib_0n6by: Whaddya mean, overlap? When possible the stuff we save gets shoved on there. [04:18] My understanding is that anything that these guys arcive gets shoved onto archive.org if at all possible [04:19] aren't we the voulunteer guirilia warriors of archive.org? Acting by ourself, but hope we do archive.org's biddings [04:19] we're just a bit more aggressive/proactive at fetching stuff [04:19] Overlap... They have their own web spiders for I guess more casual site grabs. Guess you guys pull full sites and if it is a current / recent copy, they wouldn't have the depth nor record of it at that time anyway. [04:19] Ya... [04:19] Forget you guys are a rogue branch of bad asses ;) [04:19] :D [04:19] They use outdated systems like robots.txt [04:20] robots.txt are made for one thing, to be archived [04:20] exactly [04:20] also to point out interesting things [04:20] Eh, robots.txt aren't "outdated". Just completely at odds with archiveteam. [04:20] Though I suppose understandable archive.org listens to them; probably makes their legal standing rather less white-knuckle. [04:20] altought they sometime point to redirect loops :\ [04:20] it was invented when robots could actually overload sites [04:21] Robots.txt was always a sign in the road and not even a legally binding one at that. [04:21] or break networks [04:21] Yeah, but having an easy "just opt out" thing probably significantly reduces the random crazies. [04:21] It doesn't stop you guys :P [04:21] (no accounting for insanity though.) [04:22] mib_0n6by: Yeah, but what're they gonna do, sue a bunch of random folks? [04:22] in a bunch of random countries [04:22] Who may or may not be identifiable. [04:22] Does robots.txt have any legal basis? At worst you guys are running a friendly DDOS archive attack. [04:22] and will probably invode the streisand effect if bothered [04:23] You gotta *really* piss off a big company to get that sort of wide-scatter individual lawsuit going. [04:23] mib_0n6by: Not really, though I suspect in a court of law you could at least *argue* that a lack of robots.txt is equal to saying "hey, do whatever you want". [04:23] it'd probably be cheaper to just give us the drives the data is on than to sue us [04:24] When was any company actually reasonable? [04:24] never, but they like money a whole lot [04:24] Now, I suppose there's a chance that Yahoo! does that the next time they bring down a service. [04:24] They don't care about things such as culture heritage, memory and understanding history though. [04:25] Bad PR&have to pay lawyers [04:25] Ya... Yahoo! is still working through the bad press from shutting down geocities /sarcasm. [04:26] >Have to pay lawyere. >PAY [04:27] That assumes that corporations are a thinking beast that have morals, values and cares. [04:28] Much less ones that align with you. [04:28] they care about getting more money [04:28] Which preserving a cultural heritage obviously allows them to collect. [04:29] i mean there is a financial downside to lawsuits [04:29] they don't give one shit about culture [04:30] *** mib_0n6by has left [04:33] *** kyan_ is now known as kyan [04:36] !a http://www.reddit.com/r/frc/ --phantomjs [04:36] oops [04:41] I was going to say that archiveteam projects can be construed in the US as a violation of the CFAA if a website's ToS has anti-DoS provisions [04:41] but the CFAA is so broad, fuck it [04:42] I'm sure there's a way you can construe that law so that you can get arrested for typing [04:47] I believe we'd probably not be worth suing, and the EFF would be all over the case [04:48] Police would consider it not worth their time, since we are always careful to not overload the site [04:49] police? do they get involved when lawsuit? [04:49] or maybe you thought two different scenarious [04:50] yes [04:50] either a lawsuit or contacting the feds over that law [04:51] I actually have access to a pretty good legal fund and a great lawyer if I was to be targeted... but doubt it very much [04:58] I usually bring up the lawsuit line in a "psh who cares" fashion [04:58] it's roughly on the same level of concern as jaywalking, and far less dangerous [04:58] p. much [04:59] between getting hit with Stephen Heymann or getting hit with a car I'll take Heymann [04:59] at least you can damage Heymann [04:59] oh right I have +o [05:00] woop woop woop off topic siren [05:10] *** aaaaaaaaa has quit IRC (Leaving) [05:26] *** StartAway is now known as Start [05:29] *** Start is now known as StartAway [06:07] *** mistym has joined #archiveteam [06:34] *** dashcloud has quit IRC (Read error: Operation timed out) [06:34] *** dashcloud has joined #archiveteam [07:12] YEAH [07:13] My MS-DOS thing has finished [07:13] All the booting verified, and the script that hit the Mobygames site now does a great job [07:30] *** dashcloud has quit IRC (Read error: Operation timed out) [07:34] *** brayden_ has joined #archiveteam [07:37] *** lytv has quit IRC (Read error: Operation timed out) [07:38] *** lytv has joined #archiveteam [07:39] *** dashcloud has joined #archiveteam [07:40] *** brayden has quit IRC (Read error: Operation timed out) [07:42] *** dashcloud has quit IRC (Read error: Operation timed out) [07:45] *** dashcloud has joined #archiveteam [08:26] *** primus104 has joined #archiveteam [08:39] *** philpem has joined #archiveteam [08:40] *** kris33 has joined #archiveteam [09:16] *** dashcloud has quit IRC (Read error: Operation timed out) [09:19] *** dashcloud has joined #archiveteam [09:47] *** BlueMaxim has quit IRC (Quit: Leaving) [10:16] *** mistym has quit IRC (Remote host closed the connection) [10:24] *** kris33 has quit IRC (Textual IRC Client: www.textualapp.com) [10:27] *** brayden_ has quit IRC (Ping timeout: 606 seconds) [10:37] *** Swizzle_ has joined #archiveteam [10:41] *** schbirid has joined #archiveteam [10:44] *** Swizzle has quit IRC (Read error: Operation timed out) [10:55] *** Control-S has joined #archiveteam [11:03] *** Ctrl-S has quit IRC (Read error: Operation timed out) [11:03] *** Control-S is now known as Ctrl-S [12:03] *** Ymgve has joined #archiveteam [12:31] *** brayden has joined #archiveteam [13:04] *** lbft_ has quit IRC (Ping timeout: 258 seconds) [13:21] *** lbft has joined #archiveteam [13:56] *** bauruine has quit IRC (Ping timeout: 265 seconds) [14:01] *** bauruine has joined #archiveteam [14:56] *** primus105 has joined #archiveteam [15:02] *** primus104 has quit IRC (Read error: Operation timed out) [15:12] *** archvtyp1 has joined #archiveteam [15:13] *** archvtype has quit IRC (Read error: Operation timed out) [15:33] *** BiggieJon has joined #archiveteam [15:37] *** BiggieJo1 has quit IRC (Read error: Operation timed out) [15:41] *** ohhdemgir has quit IRC (Leaving) [16:17] *** toad1 has joined #archiveteam [16:24] *** toad2 has quit IRC (Ping timeout: 600 seconds) [17:50] *** robv has joined #archiveteam [18:05] http://vstreamers.com [18:05] "Website will be shutting down day January 15th." [18:06] the site looks to be a clone of old youtube [18:06] looks like they have less then 6000 videos [18:09] i'll get to work on the site structure [18:10] got any ideas for an irc channel name? [18:10] StartAway: ok, I'll start with the scripts for vstreamer [18:11] *** StartAway is now known as Start [18:11] 10x409 pages arkiver [18:11] Yes [18:11] rather small [18:11] 21 channel pages [18:11] midas: yeah, less then 6000 videos [18:11] maybe we can run it through the bot? [18:12] those videos are not linked to from the html [18:13] probably some post somewhere (haven't checked yet) [18:14] oh well, it should be easy to grab [18:14] (size wise that is) [18:14] yeah [18:14] I already found the videos [18:14] should be doable [18:17] *** intothemo has joined #archiveteam [18:17] *** intothemo has quit IRC (Client Quit) [18:20] would #destreamers be a good name for the irc channel? [18:24] that would do I think [18:27] ok [18:40] *** nertzy has joined #archiveteam [18:52] *** nertzy has quit IRC (This computer has gone to sleep) [19:00] *** aaaaaaaaa has joined #archiveteam [19:17] *** BlueMaxim has joined #archiveteam [19:27] *** mistym has joined #archiveteam [19:33] with vstreamers shutting down, i'd place zippcast on a watchlist [19:34] zippcast has shut down multiple times in the past and reappeared without any content that was previously there [19:35] *** BlueMaxim has quit IRC (Quit: Leaving) [19:59] *** dashcloud has quit IRC (Read error: Operation timed out) [20:13] *** dashcloud has joined #archiveteam [20:56] *** dashcloud has quit IRC (Read error: Connection reset by peer) [21:01] *** signius has quit IRC (Ping timeout: 258 seconds) [21:05] *** dashcloud has joined #archiveteam [21:14] Hi [21:14] *** signius has joined #archiveteam [21:15] can anyone help me out? I want to archive this wiki http://c2.com/cgi/wiki?PrinciplesObjectivesAndGoals [21:16] could I get +v to try the bot on it? [21:19] anyone have some input, suggestions? [21:26] you can get an idea of how many links are in the wayback machine by using this link: http://web.archive.org/web/*/http://c2.com/* and there's an index of archivebot's crawls of c2.com: http://archive.fart.website/archivebot/viewer/job/xdufx [21:28] and you can search the chat logs at http://archive.fart.website/bin/irclogger_logs to see why it was aborted [21:29] *** ariscop has quit IRC (Ping timeout: 492 seconds) [21:29] it looks like the log is password protected [21:30] im not too interestedin why it stopped the archive anyway [21:30] I want to make a offline image/mirror of the site [21:30] archive.org says it has 117,838 urls [21:31] *** dashcloud has quit IRC (Read error: Operation timed out) [21:34] oh, if you want a personal archive, you can try setting up and customize archivebot for yourself, grab it with wget/wpull/httrack/heritrix, or ask someone else to do it [21:34] *** dashcloud has joined #archiveteam [21:36] thre's 35k pages and it wants a delay time of 30 seconds per get. So if I got 30 people to help me we could do this in 10 hours [21:36] that defeats the purpose of the 30s wait [21:36] I tried on my own but the delay time was too low and it stopped giving me the pages after a bit [21:37] http://c2.com/cgi/wiki?search=* says ~40k pages [21:38] ah so there's a lot of pages! [21:41] Ill email him about it again, but he ignored me before [21:41] maybe I got spam filtered [21:44] http://c2.com/cgi/wiki?DownloadWiki no I think he ignores me on purpose [21:45] i'll give it a try [21:46] > The only person who can tell you why it isn't available is its creator, WardCunningham, and he appears unwilling to do so. [21:46] lol [21:46] he's got a new wiki project on so if it doesn't go well he might do something dodgy with this site to force people onto his new page [21:46] I think it's unlikely [21:46] im not judging hIm but ive seen other people do this [21:48] wget is running [21:48] schbirid: what delays? [21:49] I'd also use random wait [21:49] can you pause and resume wget? [21:49] 30 [21:49] since it has many pages I was worried about that and wrote my own script [21:49] you can ctrl-z [21:49] ah ok cool [21:49] there's this list of pages if you havent seen it yet: http://c2.com/cgi/wiki?search=$ [21:50] there is also http://c2.com/cgi/wikiList [21:50] hopefully these two have the same stuff on them [21:50] "36855 pages found out of 36857 titles searched" [21:50] oh nice [21:50] * schbirid cancels [21:51] let me see how many lines there are in the second [21:53] eww, it has google analytics [21:53] i am doing a wget -i on the urls [21:53] will forget and find the files in 4 days or so [21:53] good night :) [21:53] *** schbirid has quit IRC (Leaving) [21:54] you should grep for 'The WikiWiki Server Can not Process Your Request' every so often [21:54] if you see this you need to wait a bit and redownload it [21:54] brook: does it return an appropriate http response code in that case? [21:55] i don't know [22:32] *** __uu has quit IRC (Ping timeout: 265 seconds) [22:33] *** ariscop has joined #archiveteam [22:43] *** cadbury__ has quit IRC (Read error: Operation timed out) [22:44] http://c2.com/cgi/wiki?WikiArchive -- LOL [22:49] *** __uu has joined #archiveteam [23:05] SketchCow: all 2006 episodes of the believers voice of victory is uploaded now [23:11] *** __uu has quit IRC (Ping timeout: 265 seconds) [23:17] *** __uu has joined #archiveteam [23:41] *** __uu has quit IRC (Ping timeout: 265 seconds) [23:43] Did someone use https://pypi.python.org/pypi/wget ? [23:56] *** __uu has joined #archiveteam