Time |
Nickname |
Message |
04:30
🔗
|
exmic |
SketchCow: yeah, ftp. |
04:30
🔗
|
exmic |
I'm in bumfuck utah this week |
09:32
🔗
|
tephra |
SketchCow: stupid question but would like to make sure, do we have a grab of: https://www.aclu.org/nsa-documents-search ? |
10:06
🔗
|
godane |
SketchCow: i see that your sorting thur my stuff |
10:53
🔗
|
etesp |
and with 3790 pages of threads in just that subforum, that's going to slow things down a lot. |
10:55
🔗
|
etesp |
is it possible to change the omits on a running archive project? |
11:01
🔗
|
midas |
you mean in archivebot? |
12:35
🔗
|
balrog |
ftp.netscape.com |
12:35
🔗
|
balrog |
probably should be archived |
12:52
🔗
|
midas |
ok balrog |
12:52
🔗
|
midas |
grabbing it now |
13:07
🔗
|
ohhdemgir |
http://www.irishtimes.com/culture/books/crowds-retrieve-100-000-books-dumped-in-skip-1.1827142 |
13:07
🔗
|
ohhdemgir |
midas, got it |
13:07
🔗
|
ohhdemgir |
https://archive.org/details/ftp.netscape.com |
13:11
🔗
|
midas |
lol, cancelling mine |
13:16
🔗
|
balrog |
sorry ;) |
13:31
🔗
|
JohnnyJac |
Hey, everybody. Saw a Defcon presentation for this project, and I absolutely love the work being done. Just moved into a new place, so my workshop isn't set up yet, but I think I may get in and throw in my efforts as well. Awesome work. |
13:32
🔗
|
joepie91 |
JohnnyJac: welcome :) |
13:33
🔗
|
joepie91 |
JohnnyJac: quick note; off-topic conversations and lengthy discussions generally take place in #archiveteam-bs, so as to not clog up this channel... it's not unusual for people to come in, shout that something's going down, and leave again |
13:33
🔗
|
joepie91 |
and keeping off-topic separate makes it easier to keep track of that |
13:35
🔗
|
JohnnyJac |
Noted, and changing IRC client accordingly. |
13:38
🔗
|
joepie91 |
:) |
14:20
🔗
|
SketchCow |
ftp.netscape.com has been grabbed three times. |
14:43
🔗
|
asie |
only three? |
14:43
🔗
|
asie |
i grabbed it once too |
14:48
🔗
|
etesp |
18:01:30 midas you mean in archivebot? |
14:48
🔗
|
etesp |
yes |
14:49
🔗
|
midas |
etesp: #archivebot |
15:18
🔗
|
DFJustin |
I just see two, with the other one being some different old incarnation https://archive.org/search.php?query=%22ftp.netscape.com%22 |
15:19
🔗
|
SketchCow |
I mean, keep grabbing it, sure. |
17:09
🔗
|
etesp |
did http://archivebot.at.ninjawedding.org:4567/#/histories/http://forums.spacebattles.com/forums/vs-debates.4/ get all the thread contents, or just the pages of threads? |
17:09
🔗
|
etesp |
Looking at the other spacebattles archiving, i'm not seeing any topics being saved |
17:10
🔗
|
etesp |
threads have urls like http://forums.spacebattles.com/threads/the-hundred-lives-of-the-dragon.301712/ |
17:10
🔗
|
DFJustin |
the threads are in a different folder so no |
17:10
🔗
|
etesp |
ah |
17:11
🔗
|
etesp |
could it be set to include links to threads? |
17:11
🔗
|
etesp |
or too large a forum for that? |
17:14
🔗
|
etesp |
they're mostly sequential, the text between /threads/ and .number/ is irrelevant, it redirects you to the right place |
17:14
🔗
|
etesp |
if that helps |
17:40
🔗
|
ohhdemgir |
SketchCow, asie https://archive.org/details/ftp_netscape_com_2013_04 says "Captured on April 2013. Contains the FTP.NETSCAPE.COM site which shut down in 2005." and is 2.4GB, the last I did was https://archive.org/details/ftp.netscape.com and is 18GB larger.. heh, still up, we do need to start checking though I guess |
17:40
🔗
|
asie |
oh, that one |
17:40
🔗
|
ohhdemgir |
midas, did you get back into your box? |
18:09
🔗
|
midas |
ohhdemgir: yep |
18:10
🔗
|
midas |
and then i cancelled them all |
18:10
🔗
|
ohhdemgir |
wut wut |
18:16
🔗
|
midas |
with the amount of money i save i can get a bigger and faster box with a different ISP |
18:38
🔗
|
SketchCow |
Hmmm, it appears to be taking Internet Archive a while to transfer this tiny 683gb file to their servers |
18:40
🔗
|
midas |
just 683GB? |
18:41
🔗
|
SketchCow |
Yeah, a pittance |
18:41
🔗
|
midas |
rubbish service I say, apple provided me with 10.000 petabyte on a mobile phone. |
18:43
🔗
|
midas |
hey SketchCow, what about a tour of IA streamed on justin.tv |
18:43
🔗
|
SketchCow |
ha ha |
18:44
🔗
|
yipdw |
make sure to turn archiving off |
18:44
🔗
|
SketchCow |
I think that's the default |
18:44
🔗
|
SketchCow |
Everywhere |
18:44
🔗
|
midas |
but seriously, im up for a tour im just about 10 hours away.. would be cool if there was a internet tour of some sort |
20:45
🔗
|
balrog |
does anyone know if Cameron Kaiser (of tenfourfox/classilla) is on twitter? Classilla throws an SSL error on https://archive.org |
22:49
🔗
|
dashcloud |
so, if I want wget to grab every page/subdomain on a site, but never go to any external domains, what commands do I need? |
23:05
🔗
|
dashcloud |
is the section "Creating a WARC with Wget" here: http://www.archiveteam.org/index.php?title=Wget as close to a canonical/recommended Wget WARC command as exists currently? |
23:13
🔗
|
nico |
dashcloud: https://github.com/ArchiveTeam/ArchiveBot/commit/4f7e460add8a3d56debf7062674cece81fd6818e#diff-5016cb0693c5048fba29437c7301f0a8L372 |
23:13
🔗
|
nico |
That's what Archivebot used when it was still wget-based |
23:17
🔗
|
dashcloud |
so, I understand most of that, but why is output-document and truncate output used? |
23:24
🔗
|
yipdw |
dashcloud: output-document and truncate-output were used to avoid generating temporary files |
23:24
🔗
|
yipdw |
by default, wget in recursive mode will mirror (as best as possible) the URL structure in the directory structure |
23:24
🔗
|
yipdw |
we don't really care about that when generating a WARC |
23:24
🔗
|
dashcloud |
so I do in fact want those if I'm planning to do a large grab then |
23:24
🔗
|
yipdw |
not if you're writing a WARC |
23:25
🔗
|
yipdw |
or you can afford the 2x disk space requirement |
23:25
🔗
|
dashcloud |
I'll leave them out then |
23:25
🔗
|
yipdw |
keep in mind too that recursive grab is problematic on bizarro directory names |
23:25
🔗
|
yipdw |
or really fucking huge URLs |
23:25
🔗
|
yipdw |
writing as WARC records sidesteps these problems |
23:39
🔗
|
dashcloud |
is there something special I need to know about inserting a warc header for wget? |
23:44
🔗
|
dashcloud |
taking the recommendations from all of the pages, here's the command I'm using: http://paste.archivingyoursh.it/kulekemini.mel wget complains about the second warc header, saying it is invalid- does every warc header need to be like the first warc header? |