Time |
Nickname |
Message |
00:13
🔗
|
|
xk_id has quit IRC (Read error: Connection reset by peer) |
00:13
🔗
|
|
xk_id_ has joined #archiveteam |
00:16
🔗
|
|
xk_id_ has quit IRC (Remote host closed the connection) |
00:16
🔗
|
|
K4k has quit IRC (Read error: Operation timed out) |
00:18
🔗
|
|
dashcloud has quit IRC (Ping timeout: 272 seconds) |
00:28
🔗
|
|
dashcloud has joined #archiveteam |
01:26
🔗
|
|
arbin has quit IRC (Read error: Connection reset by peer) |
01:28
🔗
|
|
arbin has joined #archiveteam |
01:30
🔗
|
|
__uu has joined #archiveteam |
01:31
🔗
|
|
xk_id has joined #archiveteam |
01:35
🔗
|
|
mistym_ has joined #archiveteam |
01:40
🔗
|
|
__uu has quit IRC (Ping timeout: 265 seconds) |
01:42
🔗
|
|
mistym has quit IRC (Read error: Operation timed out) |
01:56
🔗
|
|
mistym_ has quit IRC (Remote host closed the connection) |
02:03
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
02:09
🔗
|
|
__uu has joined #archiveteam |
02:10
🔗
|
|
dashcloud has joined #archiveteam |
02:11
🔗
|
|
__uu_ has joined #archiveteam |
02:13
🔗
|
|
__uu_ has quit IRC (Client Quit) |
02:24
🔗
|
|
philpem has quit IRC (Ping timeout: 272 seconds) |
02:29
🔗
|
|
primus104 has quit IRC (Leaving.) |
03:40
🔗
|
|
kyan_ has joined #archiveteam |
03:40
🔗
|
|
godane has quit IRC (Ping timeout: 272 seconds) |
03:42
🔗
|
|
kyan has quit IRC (Ping timeout: 258 seconds) |
03:56
🔗
|
|
godane has joined #archiveteam |
03:57
🔗
|
|
mib_0n6by has joined #archiveteam |
03:57
🔗
|
mib_0n6by |
Howdy - if I know of a website that is going down relatively soon, who do I talk to to possibly preserve it? |
03:58
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
03:59
🔗
|
Ctrl-S |
us i guess |
03:59
🔗
|
Ctrl-S |
what website is it? |
04:01
🔗
|
|
Lord_Nigh has joined #archiveteam |
04:01
🔗
|
Atluxity |
talk in channel, that way more people can get involved if need be |
04:02
🔗
|
Atluxity |
also, greetings! and thanks for showing up |
04:04
🔗
|
mib_0n6by |
Sorry |
04:04
🔗
|
mib_0n6by |
kb.berkeley.edu |
04:05
🔗
|
Atluxity |
do you think it requires a lot of storage? |
04:05
🔗
|
mib_0n6by |
Mostly text and few images. |
04:05
🔗
|
Atluxity |
looks like mostly text |
04:05
🔗
|
mib_0n6by |
No large files. |
04:06
🔗
|
Atluxity |
looks like a job for archiveteambot |
04:06
🔗
|
Atluxity |
do you agree Ctrl-S ? |
04:06
🔗
|
Ctrl-S |
I have no idea |
04:06
🔗
|
Atluxity |
ah, ok |
04:07
🔗
|
Ctrl-S |
https://kb.berkeley.edu/page.php?id=23247 |
04:07
🔗
|
mib_0n6by |
Hmm? |
04:07
🔗
|
Ctrl-S |
https://kb.berkeley.edu/page.php?id=23243 |
04:07
🔗
|
Ctrl-S |
might be sequential numbering for articles? |
04:08
🔗
|
mib_0n6by |
It is, but updates to articles are not. |
04:08
🔗
|
mib_0n6by |
And there are a number of subsites. |
04:08
🔗
|
Atluxity |
I have added the site to a bot used for archiving |
04:09
🔗
|
Ctrl-S |
sine it's a university hosting it, might we be able to ask the admins about archiving it locally? |
04:09
🔗
|
Atluxity |
mib_0n6by: do you know when it will go offline? |
04:09
🔗
|
mib_0n6by |
Relatively soon. |
04:09
🔗
|
Ctrl-S |
might be able to mail a HDD? |
04:09
🔗
|
mib_0n6by |
Ctrl-S: it is a UCB page hosted by the University of Wisconsin. |
04:10
🔗
|
mib_0n6by |
Easier to simply grab a copy as the entire site shouldn't be that large. |
04:10
🔗
|
Ctrl-S |
I'm pretty clueless about these matters |
04:10
🔗
|
mib_0n6by |
Trust me when I say that the site is small enough to just grab as opposed to waiting on the University to provide a copy. |
04:11
🔗
|
mib_0n6by |
(which would be a low priority and would likely take longer than just wgetting the whole thing.) |
04:11
🔗
|
Atluxity |
yeah |
04:11
🔗
|
Atluxity |
local copy if often not the best choice |
04:12
🔗
|
mib_0n6by |
Archiving the site sooner is better than not. |
04:12
🔗
|
Atluxity |
I have added the site to an archiveing bot |
04:13
🔗
|
mib_0n6by |
Thank you :) |
04:13
🔗
|
Atluxity |
and thank you |
04:13
🔗
|
Ctrl-S |
any other sites/subsites you know of that might be in need of archiving? |
04:14
🔗
|
mib_0n6by |
From the University? |
04:14
🔗
|
Ctrl-S |
anywhere really |
04:14
🔗
|
mib_0n6by |
I contacted Jason Scott a while ago about a private torrent site. |
04:15
🔗
|
mib_0n6by |
berkeley.edu is undergoing a complete site redesign soon, which means everything currently there may no longer be available or completely broken in a few months (this is for the main site only, departmental subsites are a different affair.) |
04:16
🔗
|
Ctrl-S |
I guess that means *.berkely needs archiving |
04:16
🔗
|
mib_0n6by |
Unknown ETA on the site change... |
04:16
🔗
|
|
Silent700 has left |
04:17
🔗
|
mib_0n6by |
How do you guys handle overlap with the Archive.org WayBackMachine? |
04:17
🔗
|
pikhq_ |
mib_0n6by: Whaddya mean, overlap? When possible the stuff we save gets shoved on there. |
04:18
🔗
|
Ctrl-S |
My understanding is that anything that these guys arcive gets shoved onto archive.org if at all possible |
04:19
🔗
|
Atluxity |
aren't we the voulunteer guirilia warriors of archive.org? Acting by ourself, but hope we do archive.org's biddings |
04:19
🔗
|
Ctrl-S |
we're just a bit more aggressive/proactive at fetching stuff |
04:19
🔗
|
mib_0n6by |
Overlap... They have their own web spiders for I guess more casual site grabs. Guess you guys pull full sites and if it is a current / recent copy, they wouldn't have the depth nor record of it at that time anyway. |
04:19
🔗
|
mib_0n6by |
Ya... |
04:19
🔗
|
mib_0n6by |
Forget you guys are a rogue branch of bad asses ;) |
04:19
🔗
|
Atluxity |
:D |
04:19
🔗
|
Ctrl-S |
They use outdated systems like robots.txt |
04:20
🔗
|
Atluxity |
robots.txt are made for one thing, to be archived |
04:20
🔗
|
Ctrl-S |
exactly |
04:20
🔗
|
Ctrl-S |
also to point out interesting things |
04:20
🔗
|
pikhq_ |
Eh, robots.txt aren't "outdated". Just completely at odds with archiveteam. |
04:20
🔗
|
pikhq_ |
Though I suppose understandable archive.org listens to them; probably makes their legal standing rather less white-knuckle. |
04:20
🔗
|
Atluxity |
altought they sometime point to redirect loops :\ |
04:20
🔗
|
Ctrl-S |
it was invented when robots could actually overload sites |
04:21
🔗
|
mib_0n6by |
Robots.txt was always a sign in the road and not even a legally binding one at that. |
04:21
🔗
|
Ctrl-S |
or break networks |
04:21
🔗
|
pikhq_ |
Yeah, but having an easy "just opt out" thing probably significantly reduces the random crazies. |
04:21
🔗
|
mib_0n6by |
It doesn't stop you guys :P |
04:21
🔗
|
pikhq_ |
(no accounting for insanity though.) |
04:22
🔗
|
pikhq_ |
mib_0n6by: Yeah, but what're they gonna do, sue a bunch of random folks? |
04:22
🔗
|
Ctrl-S |
in a bunch of random countries |
04:22
🔗
|
pikhq_ |
Who may or may not be identifiable. |
04:22
🔗
|
mib_0n6by |
Does robots.txt have any legal basis? At worst you guys are running a friendly DDOS archive attack. |
04:22
🔗
|
Ctrl-S |
and will probably invode the streisand effect if bothered |
04:23
🔗
|
pikhq_ |
You gotta *really* piss off a big company to get that sort of wide-scatter individual lawsuit going. |
04:23
🔗
|
pikhq_ |
mib_0n6by: Not really, though I suspect in a court of law you could at least *argue* that a lack of robots.txt is equal to saying "hey, do whatever you want". |
04:23
🔗
|
Ctrl-S |
it'd probably be cheaper to just give us the drives the data is on than to sue us |
04:24
🔗
|
mib_0n6by |
When was any company actually reasonable? |
04:24
🔗
|
Ctrl-S |
never, but they like money a whole lot |
04:24
🔗
|
pikhq_ |
Now, I suppose there's a chance that Yahoo! does that the next time they bring down a service. |
04:24
🔗
|
mib_0n6by |
They don't care about things such as culture heritage, memory and understanding history though. |
04:25
🔗
|
Ctrl-S |
Bad PR&have to pay lawyers |
04:25
🔗
|
mib_0n6by |
Ya... Yahoo! is still working through the bad press from shutting down geocities /sarcasm. |
04:26
🔗
|
Ctrl-S |
>Have to pay lawyere. >PAY |
04:27
🔗
|
mib_0n6by |
That assumes that corporations are a thinking beast that have morals, values and cares. |
04:28
🔗
|
mib_0n6by |
Much less ones that align with you. |
04:28
🔗
|
Ctrl-S |
they care about getting more money |
04:28
🔗
|
mib_0n6by |
Which preserving a cultural heritage obviously allows them to collect. |
04:29
🔗
|
Ctrl-S |
i mean there is a financial downside to lawsuits |
04:29
🔗
|
Ctrl-S |
they don't give one shit about culture |
04:30
🔗
|
|
mib_0n6by has left |
04:33
🔗
|
|
kyan_ is now known as kyan |
04:36
🔗
|
balrog |
!a http://www.reddit.com/r/frc/ --phantomjs |
04:36
🔗
|
balrog |
oops |
04:41
🔗
|
yipdw |
I was going to say that archiveteam projects can be construed in the US as a violation of the CFAA if a website's ToS has anti-DoS provisions |
04:41
🔗
|
yipdw |
but the CFAA is so broad, fuck it |
04:42
🔗
|
yipdw |
I'm sure there's a way you can construe that law so that you can get arrested for typing |
04:47
🔗
|
Ctrl-S |
I believe we'd probably not be worth suing, and the EFF would be all over the case |
04:48
🔗
|
Ctrl-S |
Police would consider it not worth their time, since we are always careful to not overload the site |
04:49
🔗
|
Atluxity |
police? do they get involved when lawsuit? |
04:49
🔗
|
Atluxity |
or maybe you thought two different scenarious |
04:50
🔗
|
Ctrl-S |
yes |
04:50
🔗
|
Ctrl-S |
either a lawsuit or contacting the feds over that law |
04:51
🔗
|
Atluxity |
I actually have access to a pretty good legal fund and a great lawyer if I was to be targeted... but doubt it very much |
04:58
🔗
|
yipdw |
I usually bring up the lawsuit line in a "psh who cares" fashion |
04:58
🔗
|
yipdw |
it's roughly on the same level of concern as jaywalking, and far less dangerous |
04:58
🔗
|
Ctrl-S |
p. much |
04:59
🔗
|
yipdw |
between getting hit with Stephen Heymann or getting hit with a car I'll take Heymann |
04:59
🔗
|
yipdw |
at least you can damage Heymann |
04:59
🔗
|
yipdw |
oh right I have +o |
05:00
🔗
|
yipdw |
woop woop woop off topic siren |
05:10
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
05:26
🔗
|
|
StartAway is now known as Start |
05:29
🔗
|
|
Start is now known as StartAway |
06:07
🔗
|
|
mistym has joined #archiveteam |
06:34
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
06:34
🔗
|
|
dashcloud has joined #archiveteam |
07:12
🔗
|
SketchCow |
YEAH |
07:13
🔗
|
SketchCow |
My MS-DOS thing has finished |
07:13
🔗
|
SketchCow |
All the booting verified, and the script that hit the Mobygames site now does a great job |
07:30
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
07:34
🔗
|
|
brayden_ has joined #archiveteam |
07:37
🔗
|
|
lytv has quit IRC (Read error: Operation timed out) |
07:38
🔗
|
|
lytv has joined #archiveteam |
07:39
🔗
|
|
dashcloud has joined #archiveteam |
07:40
🔗
|
|
brayden has quit IRC (Read error: Operation timed out) |
07:42
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
07:45
🔗
|
|
dashcloud has joined #archiveteam |
08:26
🔗
|
|
primus104 has joined #archiveteam |
08:39
🔗
|
|
philpem has joined #archiveteam |
08:40
🔗
|
|
kris33 has joined #archiveteam |
09:16
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
09:19
🔗
|
|
dashcloud has joined #archiveteam |
09:47
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
10:16
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
10:24
🔗
|
|
kris33 has quit IRC (Textual IRC Client: www.textualapp.com) |
10:27
🔗
|
|
brayden_ has quit IRC (Ping timeout: 606 seconds) |
10:37
🔗
|
|
Swizzle_ has joined #archiveteam |
10:41
🔗
|
|
schbirid has joined #archiveteam |
10:44
🔗
|
|
Swizzle has quit IRC (Read error: Operation timed out) |
10:55
🔗
|
|
Control-S has joined #archiveteam |
11:03
🔗
|
|
Ctrl-S has quit IRC (Read error: Operation timed out) |
11:03
🔗
|
|
Control-S is now known as Ctrl-S |
12:03
🔗
|
|
Ymgve has joined #archiveteam |
12:31
🔗
|
|
brayden has joined #archiveteam |
13:04
🔗
|
|
lbft_ has quit IRC (Ping timeout: 258 seconds) |
13:21
🔗
|
|
lbft has joined #archiveteam |
13:56
🔗
|
|
bauruine has quit IRC (Ping timeout: 265 seconds) |
14:01
🔗
|
|
bauruine has joined #archiveteam |
14:56
🔗
|
|
primus105 has joined #archiveteam |
15:02
🔗
|
|
primus104 has quit IRC (Read error: Operation timed out) |
15:12
🔗
|
|
archvtyp1 has joined #archiveteam |
15:13
🔗
|
|
archvtype has quit IRC (Read error: Operation timed out) |
15:33
🔗
|
|
BiggieJon has joined #archiveteam |
15:37
🔗
|
|
BiggieJo1 has quit IRC (Read error: Operation timed out) |
15:41
🔗
|
|
ohhdemgir has quit IRC (Leaving) |
16:17
🔗
|
|
toad1 has joined #archiveteam |
16:24
🔗
|
|
toad2 has quit IRC (Ping timeout: 600 seconds) |
17:50
🔗
|
|
robv has joined #archiveteam |
18:05
🔗
|
StartAway |
http://vstreamers.com |
18:05
🔗
|
StartAway |
"Website will be shutting down day January 15th." |
18:06
🔗
|
StartAway |
the site looks to be a clone of old youtube |
18:06
🔗
|
arkiver |
looks like they have less then 6000 videos |
18:09
🔗
|
StartAway |
i'll get to work on the site structure |
18:10
🔗
|
StartAway |
got any ideas for an irc channel name? |
18:10
🔗
|
arkiver |
StartAway: ok, I'll start with the scripts for vstreamer |
18:11
🔗
|
|
StartAway is now known as Start |
18:11
🔗
|
midas |
10x409 pages arkiver |
18:11
🔗
|
arkiver |
Yes |
18:11
🔗
|
midas |
rather small |
18:11
🔗
|
arkiver |
21 channel pages |
18:11
🔗
|
arkiver |
midas: yeah, less then 6000 videos |
18:11
🔗
|
midas |
maybe we can run it through the bot? |
18:12
🔗
|
arkiver |
those videos are not linked to from the html |
18:13
🔗
|
arkiver |
probably some post somewhere (haven't checked yet) |
18:14
🔗
|
midas |
oh well, it should be easy to grab |
18:14
🔗
|
midas |
(size wise that is) |
18:14
🔗
|
arkiver |
yeah |
18:14
🔗
|
arkiver |
I already found the videos |
18:14
🔗
|
arkiver |
should be doable |
18:17
🔗
|
|
intothemo has joined #archiveteam |
18:17
🔗
|
|
intothemo has quit IRC (Client Quit) |
18:20
🔗
|
Start |
would #destreamers be a good name for the irc channel? |
18:24
🔗
|
arkiver |
that would do I think |
18:27
🔗
|
Start |
ok |
18:40
🔗
|
|
nertzy has joined #archiveteam |
18:52
🔗
|
|
nertzy has quit IRC (This computer has gone to sleep) |
19:00
🔗
|
|
aaaaaaaaa has joined #archiveteam |
19:17
🔗
|
|
BlueMaxim has joined #archiveteam |
19:27
🔗
|
|
mistym has joined #archiveteam |
19:33
🔗
|
Start |
with vstreamers shutting down, i'd place zippcast on a watchlist |
19:34
🔗
|
Start |
zippcast has shut down multiple times in the past and reappeared without any content that was previously there |
19:35
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
19:59
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
20:13
🔗
|
|
dashcloud has joined #archiveteam |
20:56
🔗
|
|
dashcloud has quit IRC (Read error: Connection reset by peer) |
21:01
🔗
|
|
signius has quit IRC (Ping timeout: 258 seconds) |
21:05
🔗
|
|
dashcloud has joined #archiveteam |
21:14
🔗
|
brook |
Hi |
21:14
🔗
|
|
signius has joined #archiveteam |
21:15
🔗
|
brook |
can anyone help me out? I want to archive this wiki http://c2.com/cgi/wiki?PrinciplesObjectivesAndGoals |
21:16
🔗
|
brook |
could I get +v to try the bot on it? |
21:19
🔗
|
brook |
anyone have some input, suggestions? |
21:26
🔗
|
chfoo |
you can get an idea of how many links are in the wayback machine by using this link: http://web.archive.org/web/*/http://c2.com/* and there's an index of archivebot's crawls of c2.com: http://archive.fart.website/archivebot/viewer/job/xdufx |
21:28
🔗
|
chfoo |
and you can search the chat logs at http://archive.fart.website/bin/irclogger_logs to see why it was aborted |
21:29
🔗
|
|
ariscop has quit IRC (Ping timeout: 492 seconds) |
21:29
🔗
|
brook |
it looks like the log is password protected |
21:30
🔗
|
brook |
im not too interestedin why it stopped the archive anyway |
21:30
🔗
|
brook |
I want to make a offline image/mirror of the site |
21:30
🔗
|
brook |
archive.org says it has 117,838 urls |
21:31
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
21:34
🔗
|
chfoo |
oh, if you want a personal archive, you can try setting up and customize archivebot for yourself, grab it with wget/wpull/httrack/heritrix, or ask someone else to do it |
21:34
🔗
|
|
dashcloud has joined #archiveteam |
21:36
🔗
|
brook |
thre's 35k pages and it wants a delay time of 30 seconds per get. So if I got 30 people to help me we could do this in 10 hours |
21:36
🔗
|
schbirid |
that defeats the purpose of the 30s wait |
21:36
🔗
|
brook |
I tried on my own but the delay time was too low and it stopped giving me the pages after a bit |
21:37
🔗
|
schbirid |
http://c2.com/cgi/wiki?search=* says ~40k pages |
21:38
🔗
|
brook |
ah so there's a lot of pages! |
21:41
🔗
|
brook |
Ill email him about it again, but he ignored me before |
21:41
🔗
|
brook |
maybe I got spam filtered |
21:44
🔗
|
brook |
http://c2.com/cgi/wiki?DownloadWiki no I think he ignores me on purpose |
21:45
🔗
|
schbirid |
i'll give it a try |
21:46
🔗
|
balrog |
> The only person who can tell you why it isn't available is its creator, WardCunningham, and he appears unwilling to do so. |
21:46
🔗
|
balrog |
lol |
21:46
🔗
|
brook |
he's got a new wiki project on so if it doesn't go well he might do something dodgy with this site to force people onto his new page |
21:46
🔗
|
balrog |
I think it's unlikely |
21:46
🔗
|
brook |
im not judging hIm but ive seen other people do this |
21:48
🔗
|
schbirid |
wget is running |
21:48
🔗
|
balrog |
schbirid: what delays? |
21:49
🔗
|
balrog |
I'd also use random wait |
21:49
🔗
|
brook |
can you pause and resume wget? |
21:49
🔗
|
schbirid |
30 |
21:49
🔗
|
brook |
since it has many pages I was worried about that and wrote my own script |
21:49
🔗
|
schbirid |
you can ctrl-z |
21:49
🔗
|
brook |
ah ok cool |
21:49
🔗
|
chfoo |
there's this list of pages if you havent seen it yet: http://c2.com/cgi/wiki?search=$ |
21:50
🔗
|
brook |
there is also http://c2.com/cgi/wikiList |
21:50
🔗
|
brook |
hopefully these two have the same stuff on them |
21:50
🔗
|
balrog |
"36855 pages found out of 36857 titles searched" |
21:50
🔗
|
schbirid |
oh nice |
21:50
🔗
|
* |
schbirid cancels |
21:51
🔗
|
balrog |
let me see how many lines there are in the second |
21:53
🔗
|
schbirid |
eww, it has google analytics |
21:53
🔗
|
schbirid |
i am doing a wget -i on the urls |
21:53
🔗
|
schbirid |
will forget and find the files in 4 days or so |
21:53
🔗
|
schbirid |
good night :) |
21:53
🔗
|
|
schbirid has quit IRC (Leaving) |
21:54
🔗
|
brook |
you should grep for 'The WikiWiki Server Can not Process Your Request' every so often |
21:54
🔗
|
brook |
if you see this you need to wait a bit and redownload it |
21:54
🔗
|
balrog |
brook: does it return an appropriate http response code in that case? |
21:55
🔗
|
brook |
i don't know |
22:32
🔗
|
|
__uu has quit IRC (Ping timeout: 265 seconds) |
22:33
🔗
|
|
ariscop has joined #archiveteam |
22:43
🔗
|
|
cadbury__ has quit IRC (Read error: Operation timed out) |
22:44
🔗
|
balrog |
http://c2.com/cgi/wiki?WikiArchive -- LOL |
22:49
🔗
|
|
__uu has joined #archiveteam |
23:05
🔗
|
godane |
SketchCow: all 2006 episodes of the believers voice of victory is uploaded now |
23:11
🔗
|
|
__uu has quit IRC (Ping timeout: 265 seconds) |
23:17
🔗
|
|
__uu has joined #archiveteam |
23:41
🔗
|
|
__uu has quit IRC (Ping timeout: 265 seconds) |
23:43
🔗
|
Nemo_bis |
Did someone use https://pypi.python.org/pypi/wget ? |
23:56
🔗
|
|
__uu has joined #archiveteam |