[00:01] Just saying, I've got one in mind that seems to be in that catonic state sites go into for a long time before they lurch over and die. [00:02] *double over [00:04] namespace: there are a few kind of template wget usages [00:05] namespace: http://www.archiveteam.org/index.php?title=User:Djsmiley2k [00:05] http://www.archiveteam.org/index.php?title=Wget_with_WARC_output [00:05] arrith1: I know. [00:06] wget -e robots=off -r -l 0 -m -p --wait 1 --warc-header "operator: Archive Team" --warc-cdx --warc-file misc-yero-org http://misc.yero.org/modulez/ [00:06] namespace: oh ok [00:06] I'll definitely look at those when I set up my grab. [00:06] namespace: looks easy enough with recursion. just pointing at some base-ish url and it basically handles itself it looks like [01:07] here's some more, including some more suited to multiple grabs: http://pad.archivingyoursh.it/p/wget-warc [01:07] I did use one of the bottom two for the nwnet.co.uk grab [01:08] I didn't come up with any of them- just wrote them down so I would know what I did for next time [02:40] i'm uploading BBC Test Pilot series [03:24] uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.1 [03:24] uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.2 [03:24] uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.3 [03:24] uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.4 [03:25] uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.5 [03:25] uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.6 [03:25] so now you have that bbc series [03:26] if you add vhs to the keywords it should show up in https://archive.org/details/digitizedfromvhs [03:32] i add vhs to keywords [03:39] Question: Should I bother following a crawl delay if it's measured in minutes? [03:42] namespace: if it's not time-sensitive (as in the site isn't going away in hours/days) i'd say yeah [03:55] "If you do not understand the difference between these notations, or do not know which one to use, just use the plain ordinary format you use with your favorite browser, like Lynx or Netscape." Yeah GNU, Lynx is my go-to browser. [03:55] (Lets you know when this was written, doesn't it?) [05:52] Back [06:02] hey [06:04] hi [06:04] some one should archive this: http://old-blog.boxee.tv/ [06:09] What's the --warc-file option do? [06:09] It's not mentioned in the manual. [06:10] http://www.archiveteam.org/index.php?title=Wget_with_WARC_output describes it [06:11] ... [06:11] Damn, I've got version 1.13 [06:12] Should I go out and install a newer version or? [06:13] yes [06:13] http://lifehacker.com/log-in-to-your-yahoo-mail-address-or-lose-it-on-july-1-514371670 this is interesting [06:14] Smiley: Where do I get it? Duckduckgo is not forthcoming. [06:15] i think theres a version on github [06:15] * Smiley checks [06:16] ah it's down [06:16] I think the archiveteam github repo has a copy namespace [06:16] --save-headers is not an acceptable substitute for a small site grab? [06:17] Also, http://ftp.gnu.org/gnu/wget/wget-1.14.tar.gz [06:17] I'm not compiling it. -_- [06:17] ... [06:17] why not? [06:17] I mean I could, but it'd be tedious. [06:17] if you don't have it already your distro obviously sucks. [06:17] ./configure && make is tedious ? [06:18] Smiley: No no, managing binaries without a package manager is tedious. [06:18] well your package manager isn't managing to provide it [06:18] * Smiley has gone away for wifes birthday. [06:18] if it's not warc it can't be easily imported into the wayback machine, which makes the data a lot less accessible [06:18] Hmm. [06:19] And I'm afraid that I won't compile it with the SSL libraries or something. [06:19] There's nothing worse than dependency hell. [06:23] Okay fine I'll compile it. [06:23] i compiled wget 1.14 recently on debian. just had to do the use openssl configure option [06:23] i'm uploading the millennium celbrations on bbc1 [06:23] debian 6 that is [06:24] *celebrations [06:25] sudo apt-get install -y build-essential openssl libssl-dev [06:25] Thank you, was just about to search what the compiling package was again. [06:26] don't *install* from the make package [06:26] simply put it all elsewhere and call it directly [06:26] yeah, i just make then run it in place [06:26] symlink into PATH, or just alias, or full path. i like full path myself [06:27] yah full path [06:27] I have /home/$user/bin/tools/random_stuff_i've_compiled_here [06:27] I know how to use a binary, I've had to do it more than once, which is of course why I hate doing it.. [06:27] namespace: ./configure --with-ssl=openssl [06:27] Is that the only compile option to worry about? [06:27] I read the manual and it didn't mention any others. [06:28] namespace: it worked just like that fine for me on debian 6 [06:28] K. [06:35] Okay, got it. [06:39] good [06:41] I guess my linux-fu has improved significantly the last time I tried to do this. [06:41] Because I remember binaries being nothing but trouble. [06:42] maybe depends on the binary. one thing is to not do 'make install' because removing it can be tricky [06:59] And I've started, wish me luck. [07:19] gl [07:20] My biggest fear is actually being too aggressive with the grab. [07:21] I bandwidth limited to dial-up and did -w 10, and I *think* I'm using robots.txt, which I guess is enough. [07:22] namespace: yeah if it's not time sensitive you can scale back. also to be nice use an honest useragent with "ArchiveTeam" in it somewhere [07:22] I guess it's not time sensitive, but I'm not sure that wget will resume the download correctly, and how far would you scale back to? [07:23] (And I used the user agent on smileys user profile. [07:23] hm i think your current options are okay [07:24] i mean scale back from the defaults of like no wait [07:24] Yeah, smileys script is much more aggressive than mine. [07:24] Quarter of a second wait time + explicitly ignores robots.txt [07:29] after watching people fumble through building wget yet again I have added basic instructions to the wiki http://archiveteam.org/index.php?title=Wget [07:30] omf_: "fumble" probably isn't the right word. It's not like it took me a very long time to do it. [07:31] The only reason it took as long as it did is that I haven't compiled anything on this install yet. [07:31] (And thus had to grab build-essential/libssl/etc) [07:31] dude you complained the whole time I'm not compiling it. -_- [07:32] omf_: Yeah, that's because I hate compiling binaries for miscellaneous utilities. [07:32] talk about a first world problem [07:32] It's not the compiling that sucks, it's the management. [07:32] omf_: ./configure --with-ssl=openssl [07:32] omf_: on debian 6 i needed: sudo apt-get install -y build-essential openssl libssl-dev [07:32] Do it more than a few times and finding your shite gets annoying. [07:33] omf_: And since you're probably not getting that tone out of my voice right now: Thank you. [07:34] arrith1, The instructions are distro independent. No assumptions are made about which distribution you use. Not everyone uses Debian [07:36] omf_: One of the great annoyances of giving instructions is that distro-independent instructions are useless, if you know what to change then you don't need the instructions, if you don't know what to change then the instructions will only frustrate you. [07:36] Then again, the SSL support isn't strictly necessary. [07:36] if you are grabbing an https site it is [07:37] which happens a lot [07:37] both greader-grab and greader-directory-grab could use a few more downloaders, in case anyone is up for it [07:37] 11 days left [07:37] ivan`: is directory-grab particularly bandwidth-intensive? [07:38] omf_: Keyword there is strictly. It's probably better to just tell people to grab libssl-dev and to add the configure flag, which is distro independent. [07:39] winr4r: no, very low bandwidth [07:39] (Well maybe not the libssl-dev, the package could be named something else.) [07:39] ivan`: fuck yeah i'm on it [07:39] thanks [07:39] first off you do not need the flag if the dependency is properly installed. It is an autotools config [07:39] omf_: Ah. [07:40] which is signified by the fact you run ./configure to setup the build [07:43] also [07:44] someone with commit access should probably fix the README.md for greader-directory-grab because the instructions don't work [07:45] # Start downloading with: [07:45] screen ~/.local/bin/run-pipeline --disable-web-server pipeline.py YOURNICKNAME [07:45] what doesn't work about them [07:45] Whoever added the "stop immediately" button to the warrior, you are a saint. [07:45] ~/.local/bin/run-pipeline does not exist [07:46] But I have to know, what does it do? [07:46] (Plain shutdown, save state and sleep, ?) [07:46] winr4r: it does if you installed seesaw with the pip install --user command [07:46] winr4r: maybe you need just run-pipeline? [07:46] ivan`: oh, yeah, that worked [07:46] hurr [07:47] i think i installed pip as root [07:47] sorry [07:50] why does these instructions have you install pip from the package manager and then download and install it again via source [07:50] omf_: that is not the case [07:55] winr4r meant that he installed seesaw as root [07:56] I just tried your debian 6 instructions on Debian 6.0.2 and they fail with dependency problems. Did you mean the newest version of Debian 6.x series? [07:56] yes, most likely, sorry [07:57] I tested debian-6.0.3-amd64 and debian-6.0.3-i386 [07:57] here is the output on 6.0.2 http://paste.archivingyoursh.it/sutoyaweso.vhdl [07:58] is this a 32-bit machine on which ./wget-lua-warrior runs fine? [07:59] I don't know how to resolve that dependency situation [08:00] ivan`, did you test debian 7.0 or 7.1? [08:00] debian-7.0.0-amd64 [08:02] ivan`, don't worry about resolving that problem. I added some version info to the readme which should make things clearer https://github.com/ArchiveTeam/greader-directory-grab/blob/master/README.md [08:03] thanks [10:02] http://www.archiveteam.org/index.php?title=ScreenshotsDatabase.com I think this site died without anybody noticing [10:06] ivan`: bluh, the domain was privately registered too so no way of contacting the guy [10:06] and the only contact email for the guy was admin@screenshotsdatabase.com which obviously won't work [10:08] http://www.atari-forum.com/viewtopic.php?f=14&t=16820 try PMing him on this forum? [10:50] You know, maybe we should have a bot in the channel. [10:50] Once a day it pings every site on deathwatch, so that doesn't happen. [10:50] Maybe even more often than that. [10:51] would that have helped in this case? [10:51] Was it already on deathwatch? [10:51] I don't think so [10:51] That was the impression I got. [10:52] Oh nevermind then. [10:52] more importantly, a bot that yells at people about every unsaved site [10:52] * namespace shrugs [10:53] okay, so the domain still exists [10:53] i'm going to email him and see if it bounces [10:54] Good luck. [10:55] sent, let's see what happens [10:59] WELL IT DIDN'T BOUNCE YET [11:09] surely such a bot would only tell us when it's too late, anyway? *looks puzzled* [11:10] YouDunGoofedBot [11:11] also [11:11] were @me.com mobileme email addresses? [11:12] i'm trying to get in contact with another guy who had a big archive of stuff and then his site died [11:17] yes, I believe they were - now iCloud, of course [11:23] winr4r: They should still be working AFAIK [11:36] emailed it anyway, we'll find out in a few minutes [15:07] *You get ops! And *you* get ops! [15:11] because i wanted to make my pc do something i started grabbing the hem.passagen.se sites, 13k after a day [15:35] when using wget to make a warc, the options --adjust-extension and --convert-links are ignored for the warc, correct? [18:39] my hem.passagen.se downloader: https://gist.github.com/SpiritQuaddicted/5ff86e28ad786ced0988 [19:50] SketchCow: could mirror of FTP sites also be uploaded with the files as is, or must they be compressed into a single archive as usual [19:56] SketchCow: word is you have a contact on the Google Data Liberation team, if at all possible it would be great if you could request from them any and all Google Reader data, but at least a list of feed URLs that they have cached data for. [20:04] uploaded: https://archive.org/details/torrentfreak.com-20130619 [20:09] Nemo_bis: Make it an archive [20:09] Anything else blows it up [20:14] oh wow, i forgot -np [20:33] SketchCow: ok, but how to make an archive without free disk space, deleting each file as soon as added, without resorting to ugly scripts? [20:33] zip -rm deletes only at the end [20:36] for x in ./; do zip -command_option_to_add_file; done [20:39] hmmm [20:39] subnet mask for single IP, /31 or /32 ? [21:36] Smiley: that doesn't work too well with huge subdirs [21:36] for x in ./{1..something} [21:43] find . -exec [do something with '{};'] ? [21:45] there are at least a few stackoverflow (probably serverfault too) posts about bash and dealing with large numbers of items [21:45] one recent one was about deleting large numbers of files, and rsync was the fastest