#archiveteam 2015-01-04,Sun

↑back Search

Time Nickname Message
00:13 🔗 xk_id has quit IRC (Read error: Connection reset by peer)
00:13 🔗 xk_id_ has joined #archiveteam
00:16 🔗 xk_id_ has quit IRC (Remote host closed the connection)
00:16 🔗 K4k has quit IRC (Read error: Operation timed out)
00:18 🔗 dashcloud has quit IRC (Ping timeout: 272 seconds)
00:28 🔗 dashcloud has joined #archiveteam
01:26 🔗 arbin has quit IRC (Read error: Connection reset by peer)
01:28 🔗 arbin has joined #archiveteam
01:30 🔗 __uu has joined #archiveteam
01:31 🔗 xk_id has joined #archiveteam
01:35 🔗 mistym_ has joined #archiveteam
01:40 🔗 __uu has quit IRC (Ping timeout: 265 seconds)
01:42 🔗 mistym has quit IRC (Read error: Operation timed out)
01:56 🔗 mistym_ has quit IRC (Remote host closed the connection)
02:03 🔗 dashcloud has quit IRC (Read error: Operation timed out)
02:09 🔗 __uu has joined #archiveteam
02:10 🔗 dashcloud has joined #archiveteam
02:11 🔗 __uu_ has joined #archiveteam
02:13 🔗 __uu_ has quit IRC (Client Quit)
02:24 🔗 philpem has quit IRC (Ping timeout: 272 seconds)
02:29 🔗 primus104 has quit IRC (Leaving.)
03:40 🔗 kyan_ has joined #archiveteam
03:40 🔗 godane has quit IRC (Ping timeout: 272 seconds)
03:42 🔗 kyan has quit IRC (Ping timeout: 258 seconds)
03:56 🔗 godane has joined #archiveteam
03:57 🔗 mib_0n6by has joined #archiveteam
03:57 🔗 mib_0n6by Howdy - if I know of a website that is going down relatively soon, who do I talk to to possibly preserve it?
03:58 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
03:59 🔗 Ctrl-S us i guess
03:59 🔗 Ctrl-S what website is it?
04:01 🔗 Lord_Nigh has joined #archiveteam
04:01 🔗 Atluxity talk in channel, that way more people can get involved if need be
04:02 🔗 Atluxity also, greetings! and thanks for showing up
04:04 🔗 mib_0n6by Sorry
04:04 🔗 mib_0n6by kb.berkeley.edu
04:05 🔗 Atluxity do you think it requires a lot of storage?
04:05 🔗 mib_0n6by Mostly text and few images.
04:05 🔗 Atluxity looks like mostly text
04:05 🔗 mib_0n6by No large files.
04:06 🔗 Atluxity looks like a job for archiveteambot
04:06 🔗 Atluxity do you agree Ctrl-S ?
04:06 🔗 Ctrl-S I have no idea
04:06 🔗 Atluxity ah, ok
04:07 🔗 Ctrl-S https://kb.berkeley.edu/page.php?id=23247
04:07 🔗 mib_0n6by Hmm?
04:07 🔗 Ctrl-S https://kb.berkeley.edu/page.php?id=23243
04:07 🔗 Ctrl-S might be sequential numbering for articles?
04:08 🔗 mib_0n6by It is, but updates to articles are not.
04:08 🔗 mib_0n6by And there are a number of subsites.
04:08 🔗 Atluxity I have added the site to a bot used for archiving
04:09 🔗 Ctrl-S sine it's a university hosting it, might we be able to ask the admins about archiving it locally?
04:09 🔗 Atluxity mib_0n6by: do you know when it will go offline?
04:09 🔗 mib_0n6by Relatively soon.
04:09 🔗 Ctrl-S might be able to mail a HDD?
04:09 🔗 mib_0n6by Ctrl-S: it is a UCB page hosted by the University of Wisconsin.
04:10 🔗 mib_0n6by Easier to simply grab a copy as the entire site shouldn't be that large.
04:10 🔗 Ctrl-S I'm pretty clueless about these matters
04:10 🔗 mib_0n6by Trust me when I say that the site is small enough to just grab as opposed to waiting on the University to provide a copy.
04:11 🔗 mib_0n6by (which would be a low priority and would likely take longer than just wgetting the whole thing.)
04:11 🔗 Atluxity yeah
04:11 🔗 Atluxity local copy if often not the best choice
04:12 🔗 mib_0n6by Archiving the site sooner is better than not.
04:12 🔗 Atluxity I have added the site to an archiveing bot
04:13 🔗 mib_0n6by Thank you :)
04:13 🔗 Atluxity and thank you
04:13 🔗 Ctrl-S any other sites/subsites you know of that might be in need of archiving?
04:14 🔗 mib_0n6by From the University?
04:14 🔗 Ctrl-S anywhere really
04:14 🔗 mib_0n6by I contacted Jason Scott a while ago about a private torrent site.
04:15 🔗 mib_0n6by berkeley.edu is undergoing a complete site redesign soon, which means everything currently there may no longer be available or completely broken in a few months (this is for the main site only, departmental subsites are a different affair.)
04:16 🔗 Ctrl-S I guess that means *.berkely needs archiving
04:16 🔗 mib_0n6by Unknown ETA on the site change...
04:16 🔗 Silent700 has left
04:17 🔗 mib_0n6by How do you guys handle overlap with the Archive.org WayBackMachine?
04:17 🔗 pikhq_ mib_0n6by: Whaddya mean, overlap? When possible the stuff we save gets shoved on there.
04:18 🔗 Ctrl-S My understanding is that anything that these guys arcive gets shoved onto archive.org if at all possible
04:19 🔗 Atluxity aren't we the voulunteer guirilia warriors of archive.org? Acting by ourself, but hope we do archive.org's biddings
04:19 🔗 Ctrl-S we're just a bit more aggressive/proactive at fetching stuff
04:19 🔗 mib_0n6by Overlap... They have their own web spiders for I guess more casual site grabs. Guess you guys pull full sites and if it is a current / recent copy, they wouldn't have the depth nor record of it at that time anyway.
04:19 🔗 mib_0n6by Ya...
04:19 🔗 mib_0n6by Forget you guys are a rogue branch of bad asses ;)
04:19 🔗 Atluxity :D
04:19 🔗 Ctrl-S They use outdated systems like robots.txt
04:20 🔗 Atluxity robots.txt are made for one thing, to be archived
04:20 🔗 Ctrl-S exactly
04:20 🔗 Ctrl-S also to point out interesting things
04:20 🔗 pikhq_ Eh, robots.txt aren't "outdated". Just completely at odds with archiveteam.
04:20 🔗 pikhq_ Though I suppose understandable archive.org listens to them; probably makes their legal standing rather less white-knuckle.
04:20 🔗 Atluxity altought they sometime point to redirect loops :\
04:20 🔗 Ctrl-S it was invented when robots could actually overload sites
04:21 🔗 mib_0n6by Robots.txt was always a sign in the road and not even a legally binding one at that.
04:21 🔗 Ctrl-S or break networks
04:21 🔗 pikhq_ Yeah, but having an easy "just opt out" thing probably significantly reduces the random crazies.
04:21 🔗 mib_0n6by It doesn't stop you guys :P
04:21 🔗 pikhq_ (no accounting for insanity though.)
04:22 🔗 pikhq_ mib_0n6by: Yeah, but what're they gonna do, sue a bunch of random folks?
04:22 🔗 Ctrl-S in a bunch of random countries
04:22 🔗 pikhq_ Who may or may not be identifiable.
04:22 🔗 mib_0n6by Does robots.txt have any legal basis? At worst you guys are running a friendly DDOS archive attack.
04:22 🔗 Ctrl-S and will probably invode the streisand effect if bothered
04:23 🔗 pikhq_ You gotta *really* piss off a big company to get that sort of wide-scatter individual lawsuit going.
04:23 🔗 pikhq_ mib_0n6by: Not really, though I suspect in a court of law you could at least *argue* that a lack of robots.txt is equal to saying "hey, do whatever you want".
04:23 🔗 Ctrl-S it'd probably be cheaper to just give us the drives the data is on than to sue us
04:24 🔗 mib_0n6by When was any company actually reasonable?
04:24 🔗 Ctrl-S never, but they like money a whole lot
04:24 🔗 pikhq_ Now, I suppose there's a chance that Yahoo! does that the next time they bring down a service.
04:24 🔗 mib_0n6by They don't care about things such as culture heritage, memory and understanding history though.
04:25 🔗 Ctrl-S Bad PR&have to pay lawyers
04:25 🔗 mib_0n6by Ya... Yahoo! is still working through the bad press from shutting down geocities /sarcasm.
04:26 🔗 Ctrl-S >Have to pay lawyere. >PAY
04:27 🔗 mib_0n6by That assumes that corporations are a thinking beast that have morals, values and cares.
04:28 🔗 mib_0n6by Much less ones that align with you.
04:28 🔗 Ctrl-S they care about getting more money
04:28 🔗 mib_0n6by Which preserving a cultural heritage obviously allows them to collect.
04:29 🔗 Ctrl-S i mean there is a financial downside to lawsuits
04:29 🔗 Ctrl-S they don't give one shit about culture
04:30 🔗 mib_0n6by has left
04:33 🔗 kyan_ is now known as kyan
04:36 🔗 balrog !a http://www.reddit.com/r/frc/ --phantomjs
04:36 🔗 balrog oops
04:41 🔗 yipdw I was going to say that archiveteam projects can be construed in the US as a violation of the CFAA if a website's ToS has anti-DoS provisions
04:41 🔗 yipdw but the CFAA is so broad, fuck it
04:42 🔗 yipdw I'm sure there's a way you can construe that law so that you can get arrested for typing
04:47 🔗 Ctrl-S I believe we'd probably not be worth suing, and the EFF would be all over the case
04:48 🔗 Ctrl-S Police would consider it not worth their time, since we are always careful to not overload the site
04:49 🔗 Atluxity police? do they get involved when lawsuit?
04:49 🔗 Atluxity or maybe you thought two different scenarious
04:50 🔗 Ctrl-S yes
04:50 🔗 Ctrl-S either a lawsuit or contacting the feds over that law
04:51 🔗 Atluxity I actually have access to a pretty good legal fund and a great lawyer if I was to be targeted... but doubt it very much
04:58 🔗 yipdw I usually bring up the lawsuit line in a "psh who cares" fashion
04:58 🔗 yipdw it's roughly on the same level of concern as jaywalking, and far less dangerous
04:58 🔗 Ctrl-S p. much
04:59 🔗 yipdw between getting hit with Stephen Heymann or getting hit with a car I'll take Heymann
04:59 🔗 yipdw at least you can damage Heymann
04:59 🔗 yipdw oh right I have +o
05:00 🔗 yipdw woop woop woop off topic siren
05:10 🔗 aaaaaaaaa has quit IRC (Leaving)
05:26 🔗 StartAway is now known as Start
05:29 🔗 Start is now known as StartAway
06:07 🔗 mistym has joined #archiveteam
06:34 🔗 dashcloud has quit IRC (Read error: Operation timed out)
06:34 🔗 dashcloud has joined #archiveteam
07:12 🔗 SketchCow YEAH
07:13 🔗 SketchCow My MS-DOS thing has finished
07:13 🔗 SketchCow All the booting verified, and the script that hit the Mobygames site now does a great job
07:30 🔗 dashcloud has quit IRC (Read error: Operation timed out)
07:34 🔗 brayden_ has joined #archiveteam
07:37 🔗 lytv has quit IRC (Read error: Operation timed out)
07:38 🔗 lytv has joined #archiveteam
07:39 🔗 dashcloud has joined #archiveteam
07:40 🔗 brayden has quit IRC (Read error: Operation timed out)
07:42 🔗 dashcloud has quit IRC (Read error: Operation timed out)
07:45 🔗 dashcloud has joined #archiveteam
08:26 🔗 primus104 has joined #archiveteam
08:39 🔗 philpem has joined #archiveteam
08:40 🔗 kris33 has joined #archiveteam
09:16 🔗 dashcloud has quit IRC (Read error: Operation timed out)
09:19 🔗 dashcloud has joined #archiveteam
09:47 🔗 BlueMaxim has quit IRC (Quit: Leaving)
10:16 🔗 mistym has quit IRC (Remote host closed the connection)
10:24 🔗 kris33 has quit IRC (Textual IRC Client: www.textualapp.com)
10:27 🔗 brayden_ has quit IRC (Ping timeout: 606 seconds)
10:37 🔗 Swizzle_ has joined #archiveteam
10:41 🔗 schbirid has joined #archiveteam
10:44 🔗 Swizzle has quit IRC (Read error: Operation timed out)
10:55 🔗 Control-S has joined #archiveteam
11:03 🔗 Ctrl-S has quit IRC (Read error: Operation timed out)
11:03 🔗 Control-S is now known as Ctrl-S
12:03 🔗 Ymgve has joined #archiveteam
12:31 🔗 brayden has joined #archiveteam
13:04 🔗 lbft_ has quit IRC (Ping timeout: 258 seconds)
13:21 🔗 lbft has joined #archiveteam
13:56 🔗 bauruine has quit IRC (Ping timeout: 265 seconds)
14:01 🔗 bauruine has joined #archiveteam
14:56 🔗 primus105 has joined #archiveteam
15:02 🔗 primus104 has quit IRC (Read error: Operation timed out)
15:12 🔗 archvtyp1 has joined #archiveteam
15:13 🔗 archvtype has quit IRC (Read error: Operation timed out)
15:33 🔗 BiggieJon has joined #archiveteam
15:37 🔗 BiggieJo1 has quit IRC (Read error: Operation timed out)
15:41 🔗 ohhdemgir has quit IRC (Leaving)
16:17 🔗 toad1 has joined #archiveteam
16:24 🔗 toad2 has quit IRC (Ping timeout: 600 seconds)
17:50 🔗 robv has joined #archiveteam
18:05 🔗 StartAway http://vstreamers.com
18:05 🔗 StartAway "Website will be shutting down day January 15th."
18:06 🔗 StartAway the site looks to be a clone of old youtube
18:06 🔗 arkiver looks like they have less then 6000 videos
18:09 🔗 StartAway i'll get to work on the site structure
18:10 🔗 StartAway got any ideas for an irc channel name?
18:10 🔗 arkiver StartAway: ok, I'll start with the scripts for vstreamer
18:11 🔗 StartAway is now known as Start
18:11 🔗 midas 10x409 pages arkiver
18:11 🔗 arkiver Yes
18:11 🔗 midas rather small
18:11 🔗 arkiver 21 channel pages
18:11 🔗 arkiver midas: yeah, less then 6000 videos
18:11 🔗 midas maybe we can run it through the bot?
18:12 🔗 arkiver those videos are not linked to from the html
18:13 🔗 arkiver probably some post somewhere (haven't checked yet)
18:14 🔗 midas oh well, it should be easy to grab
18:14 🔗 midas (size wise that is)
18:14 🔗 arkiver yeah
18:14 🔗 arkiver I already found the videos
18:14 🔗 arkiver should be doable
18:17 🔗 intothemo has joined #archiveteam
18:17 🔗 intothemo has quit IRC (Client Quit)
18:20 🔗 Start would #destreamers be a good name for the irc channel?
18:24 🔗 arkiver that would do I think
18:27 🔗 Start ok
18:40 🔗 nertzy has joined #archiveteam
18:52 🔗 nertzy has quit IRC (This computer has gone to sleep)
19:00 🔗 aaaaaaaaa has joined #archiveteam
19:17 🔗 BlueMaxim has joined #archiveteam
19:27 🔗 mistym has joined #archiveteam
19:33 🔗 Start with vstreamers shutting down, i'd place zippcast on a watchlist
19:34 🔗 Start zippcast has shut down multiple times in the past and reappeared without any content that was previously there
19:35 🔗 BlueMaxim has quit IRC (Quit: Leaving)
19:59 🔗 dashcloud has quit IRC (Read error: Operation timed out)
20:13 🔗 dashcloud has joined #archiveteam
20:56 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
21:01 🔗 signius has quit IRC (Ping timeout: 258 seconds)
21:05 🔗 dashcloud has joined #archiveteam
21:14 🔗 brook Hi
21:14 🔗 signius has joined #archiveteam
21:15 🔗 brook can anyone help me out? I want to archive this wiki http://c2.com/cgi/wiki?PrinciplesObjectivesAndGoals
21:16 🔗 brook could I get +v to try the bot on it?
21:19 🔗 brook anyone have some input, suggestions?
21:26 🔗 chfoo you can get an idea of how many links are in the wayback machine by using this link: http://web.archive.org/web/*/http://c2.com/* and there's an index of archivebot's crawls of c2.com: http://archive.fart.website/archivebot/viewer/job/xdufx
21:28 🔗 chfoo and you can search the chat logs at http://archive.fart.website/bin/irclogger_logs to see why it was aborted
21:29 🔗 ariscop has quit IRC (Ping timeout: 492 seconds)
21:29 🔗 brook it looks like the log is password protected
21:30 🔗 brook im not too interestedin why it stopped the archive anyway
21:30 🔗 brook I want to make a offline image/mirror of the site
21:30 🔗 brook archive.org says it has 117,838 urls
21:31 🔗 dashcloud has quit IRC (Read error: Operation timed out)
21:34 🔗 chfoo oh, if you want a personal archive, you can try setting up and customize archivebot for yourself, grab it with wget/wpull/httrack/heritrix, or ask someone else to do it
21:34 🔗 dashcloud has joined #archiveteam
21:36 🔗 brook thre's 35k pages and it wants a delay time of 30 seconds per get. So if I got 30 people to help me we could do this in 10 hours
21:36 🔗 schbirid that defeats the purpose of the 30s wait
21:36 🔗 brook I tried on my own but the delay time was too low and it stopped giving me the pages after a bit
21:37 🔗 schbirid http://c2.com/cgi/wiki?search=* says ~40k pages
21:38 🔗 brook ah so there's a lot of pages!
21:41 🔗 brook Ill email him about it again, but he ignored me before
21:41 🔗 brook maybe I got spam filtered
21:44 🔗 brook http://c2.com/cgi/wiki?DownloadWiki no I think he ignores me on purpose
21:45 🔗 schbirid i'll give it a try
21:46 🔗 balrog > The only person who can tell you why it isn't available is its creator, WardCunningham, and he appears unwilling to do so.
21:46 🔗 balrog lol
21:46 🔗 brook he's got a new wiki project on so if it doesn't go well he might do something dodgy with this site to force people onto his new page
21:46 🔗 balrog I think it's unlikely
21:46 🔗 brook im not judging hIm but ive seen other people do this
21:48 🔗 schbirid wget is running
21:48 🔗 balrog schbirid: what delays?
21:49 🔗 balrog I'd also use random wait
21:49 🔗 brook can you pause and resume wget?
21:49 🔗 schbirid 30
21:49 🔗 brook since it has many pages I was worried about that and wrote my own script
21:49 🔗 schbirid you can ctrl-z
21:49 🔗 brook ah ok cool
21:49 🔗 chfoo there's this list of pages if you havent seen it yet: http://c2.com/cgi/wiki?search=$
21:50 🔗 brook there is also http://c2.com/cgi/wikiList
21:50 🔗 brook hopefully these two have the same stuff on them
21:50 🔗 balrog "36855 pages found out of 36857 titles searched"
21:50 🔗 schbirid oh nice
21:50 🔗 * schbirid cancels
21:51 🔗 balrog let me see how many lines there are in the second
21:53 🔗 schbirid eww, it has google analytics
21:53 🔗 schbirid i am doing a wget -i on the urls
21:53 🔗 schbirid will forget and find the files in 4 days or so
21:53 🔗 schbirid good night :)
21:53 🔗 schbirid has quit IRC (Leaving)
21:54 🔗 brook you should grep for 'The WikiWiki Server Can not Process Your Request' every so often
21:54 🔗 brook if you see this you need to wait a bit and redownload it
21:54 🔗 balrog brook: does it return an appropriate http response code in that case?
21:55 🔗 brook i don't know
22:32 🔗 __uu has quit IRC (Ping timeout: 265 seconds)
22:33 🔗 ariscop has joined #archiveteam
22:43 🔗 cadbury__ has quit IRC (Read error: Operation timed out)
22:44 🔗 balrog http://c2.com/cgi/wiki?WikiArchive -- LOL
22:49 🔗 __uu has joined #archiveteam
23:05 🔗 godane SketchCow: all 2006 episodes of the believers voice of victory is uploaded now
23:11 🔗 __uu has quit IRC (Ping timeout: 265 seconds)
23:17 🔗 __uu has joined #archiveteam
23:41 🔗 __uu has quit IRC (Ping timeout: 265 seconds)
23:43 🔗 Nemo_bis Did someone use https://pypi.python.org/pypi/wget ?
23:56 🔗 __uu has joined #archiveteam

irclogger-viewer