#archiveteam 2013-06-20,Thu

↑back Search

Time Nickname Message
00:01 🔗 namespace Just saying, I've got one in mind that seems to be in that catonic state sites go into for a long time before they lurch over and die.
00:02 🔗 namespace *double over
00:04 🔗 arrith1 namespace: there are a few kind of template wget usages
00:05 🔗 arrith1 namespace: http://www.archiveteam.org/index.php?title=User:Djsmiley2k
00:05 🔗 arrith1 http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
00:05 🔗 namespace arrith1: I know.
00:06 🔗 arrith1 wget -e robots=off -r -l 0 -m -p --wait 1 --warc-header "operator: Archive Team" --warc-cdx --warc-file misc-yero-org http://misc.yero.org/modulez/
00:06 🔗 arrith1 namespace: oh ok
00:06 🔗 namespace I'll definitely look at those when I set up my grab.
00:06 🔗 arrith1 namespace: looks easy enough with recursion. just pointing at some base-ish url and it basically handles itself it looks like
01:07 🔗 dashcloud here's some more, including some more suited to multiple grabs: http://pad.archivingyoursh.it/p/wget-warc
01:07 🔗 dashcloud I did use one of the bottom two for the nwnet.co.uk grab
01:08 🔗 dashcloud I didn't come up with any of them- just wrote them down so I would know what I did for next time
02:40 🔗 godane i'm uploading BBC Test Pilot series
03:24 🔗 godane uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.1
03:24 🔗 godane uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.2
03:24 🔗 godane uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.3
03:24 🔗 godane uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.4
03:25 🔗 godane uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.5
03:25 🔗 godane uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.6
03:25 🔗 godane so now you have that bbc series
03:26 🔗 DFJustin if you add vhs to the keywords it should show up in https://archive.org/details/digitizedfromvhs
03:32 🔗 godane i add vhs to keywords
03:39 🔗 namespace Question: Should I bother following a crawl delay if it's measured in minutes?
03:42 🔗 arrith1 namespace: if it's not time-sensitive (as in the site isn't going away in hours/days) i'd say yeah
03:55 🔗 namespace "If you do not understand the difference between these notations, or do not know which one to use, just use the plain ordinary format you use with your favorite browser, like Lynx or Netscape." Yeah GNU, Lynx is my go-to browser.
03:55 🔗 namespace (Lets you know when this was written, doesn't it?)
05:52 🔗 SketchCow Back
06:02 🔗 godane hey
06:04 🔗 BlueMax hi
06:04 🔗 godane some one should archive this: http://old-blog.boxee.tv/
06:09 🔗 namespace What's the --warc-file option do?
06:09 🔗 namespace It's not mentioned in the manual.
06:10 🔗 TrojanEel http://www.archiveteam.org/index.php?title=Wget_with_WARC_output describes it
06:11 🔗 namespace ...
06:11 🔗 namespace Damn, I've got version 1.13
06:12 🔗 namespace Should I go out and install a newer version or?
06:13 🔗 Smiley yes
06:13 🔗 BlueMax http://lifehacker.com/log-in-to-your-yahoo-mail-address-or-lose-it-on-july-1-514371670 this is interesting
06:14 🔗 namespace Smiley: Where do I get it? Duckduckgo is not forthcoming.
06:15 🔗 Smiley i think theres a version on github
06:15 🔗 * Smiley checks
06:16 🔗 Smiley ah it's down
06:16 🔗 Smiley I think the archiveteam github repo has a copy namespace
06:16 🔗 namespace --save-headers is not an acceptable substitute for a small site grab?
06:17 🔗 TrojanEel Also, http://ftp.gnu.org/gnu/wget/wget-1.14.tar.gz
06:17 🔗 namespace I'm not compiling it. -_-
06:17 🔗 Smiley ...
06:17 🔗 Smiley why not?
06:17 🔗 namespace I mean I could, but it'd be tedious.
06:17 🔗 Smiley if you don't have it already your distro obviously sucks.
06:17 🔗 Smiley ./configure && make is tedious ?
06:18 🔗 namespace Smiley: No no, managing binaries without a package manager is tedious.
06:18 🔗 Smiley well your package manager isn't managing to provide it
06:18 🔗 * Smiley has gone away for wifes birthday.
06:18 🔗 DFJustin if it's not warc it can't be easily imported into the wayback machine, which makes the data a lot less accessible
06:18 🔗 namespace Hmm.
06:19 🔗 namespace And I'm afraid that I won't compile it with the SSL libraries or something.
06:19 🔗 namespace There's nothing worse than dependency hell.
06:23 🔗 namespace Okay fine I'll compile it.
06:23 🔗 arrith1 i compiled wget 1.14 recently on debian. just had to do the use openssl configure option
06:23 🔗 godane i'm uploading the millennium celbrations on bbc1
06:23 🔗 arrith1 debian 6 that is
06:24 🔗 godane *celebrations
06:25 🔗 arrith1 sudo apt-get install -y build-essential openssl libssl-dev
06:25 🔗 namespace Thank you, was just about to search what the compiling package was again.
06:26 🔗 Smiley don't *install* from the make package
06:26 🔗 Smiley simply put it all elsewhere and call it directly
06:26 🔗 arrith1 yeah, i just make then run it in place
06:26 🔗 arrith1 symlink into PATH, or just alias, or full path. i like full path myself
06:27 🔗 Smiley yah full path
06:27 🔗 Smiley I have /home/$user/bin/tools/random_stuff_i've_compiled_here
06:27 🔗 namespace I know how to use a binary, I've had to do it more than once, which is of course why I hate doing it..
06:27 🔗 arrith1 namespace: ./configure --with-ssl=openssl
06:27 🔗 namespace Is that the only compile option to worry about?
06:27 🔗 namespace I read the manual and it didn't mention any others.
06:28 🔗 arrith1 namespace: it worked just like that fine for me on debian 6
06:28 🔗 namespace K.
06:35 🔗 namespace Okay, got it.
06:39 🔗 arrith1 good
06:41 🔗 namespace I guess my linux-fu has improved significantly the last time I tried to do this.
06:41 🔗 namespace Because I remember binaries being nothing but trouble.
06:42 🔗 arrith1 maybe depends on the binary. one thing is to not do 'make install' because removing it can be tricky
06:59 🔗 namespace And I've started, wish me luck.
07:19 🔗 arrith1 gl
07:20 🔗 namespace My biggest fear is actually being too aggressive with the grab.
07:21 🔗 namespace I bandwidth limited to dial-up and did -w 10, and I *think* I'm using robots.txt, which I guess is enough.
07:22 🔗 arrith1 namespace: yeah if it's not time sensitive you can scale back. also to be nice use an honest useragent with "ArchiveTeam" in it somewhere
07:22 🔗 namespace I guess it's not time sensitive, but I'm not sure that wget will resume the download correctly, and how far would you scale back to?
07:23 🔗 namespace (And I used the user agent on smileys user profile.
07:23 🔗 arrith1 hm i think your current options are okay
07:24 🔗 arrith1 i mean scale back from the defaults of like no wait
07:24 🔗 namespace Yeah, smileys script is much more aggressive than mine.
07:24 🔗 namespace Quarter of a second wait time + explicitly ignores robots.txt
07:29 🔗 omf_ after watching people fumble through building wget yet again I have added basic instructions to the wiki http://archiveteam.org/index.php?title=Wget
07:30 🔗 namespace omf_: "fumble" probably isn't the right word. It's not like it took me a very long time to do it.
07:31 🔗 namespace The only reason it took as long as it did is that I haven't compiled anything on this install yet.
07:31 🔗 namespace (And thus had to grab build-essential/libssl/etc)
07:31 🔗 omf_ dude you complained the whole time <namespace> I'm not compiling it. -_-
07:32 🔗 namespace omf_: Yeah, that's because I hate compiling binaries for miscellaneous utilities.
07:32 🔗 omf_ talk about a first world problem
07:32 🔗 namespace It's not the compiling that sucks, it's the management.
07:32 🔗 arrith1 omf_: ./configure --with-ssl=openssl
07:32 🔗 arrith1 omf_: on debian 6 i needed: sudo apt-get install -y build-essential openssl libssl-dev
07:32 🔗 namespace Do it more than a few times and finding your shite gets annoying.
07:33 🔗 namespace omf_: And since you're probably not getting that tone out of my voice right now: Thank you.
07:34 🔗 omf_ arrith1, The instructions are distro independent. No assumptions are made about which distribution you use. Not everyone uses Debian
07:36 🔗 namespace omf_: One of the great annoyances of giving instructions is that distro-independent instructions are useless, if you know what to change then you don't need the instructions, if you don't know what to change then the instructions will only frustrate you.
07:36 🔗 namespace Then again, the SSL support isn't strictly necessary.
07:36 🔗 omf_ if you are grabbing an https site it is
07:37 🔗 omf_ which happens a lot
07:37 🔗 ivan` both greader-grab and greader-directory-grab could use a few more downloaders, in case anyone is up for it
07:37 🔗 ivan` 11 days left
07:37 🔗 winr4r ivan`: is directory-grab particularly bandwidth-intensive?
07:38 🔗 namespace omf_: Keyword there is strictly. It's probably better to just tell people to grab libssl-dev and to add the configure flag, which is distro independent.
07:39 🔗 ivan` winr4r: no, very low bandwidth
07:39 🔗 namespace (Well maybe not the libssl-dev, the package could be named something else.)
07:39 🔗 winr4r ivan`: fuck yeah i'm on it
07:39 🔗 ivan` thanks
07:39 🔗 omf_ first off you do not need the flag if the dependency is properly installed. It is an autotools config
07:39 🔗 namespace omf_: Ah.
07:40 🔗 omf_ which is signified by the fact you run ./configure to setup the build
07:43 🔗 winr4r also
07:44 🔗 winr4r someone with commit access should probably fix the README.md for greader-directory-grab because the instructions don't work
07:45 🔗 winr4r # Start downloading with:
07:45 🔗 winr4r screen ~/.local/bin/run-pipeline --disable-web-server pipeline.py YOURNICKNAME
07:45 🔗 ivan` what doesn't work about them
07:45 🔗 namespace Whoever added the "stop immediately" button to the warrior, you are a saint.
07:45 🔗 winr4r ~/.local/bin/run-pipeline does not exist
07:46 🔗 namespace But I have to know, what does it do?
07:46 🔗 namespace (Plain shutdown, save state and sleep, ?)
07:46 🔗 ivan` winr4r: it does if you installed seesaw with the pip install --user command
07:46 🔗 ivan` winr4r: maybe you need just run-pipeline?
07:46 🔗 winr4r ivan`: oh, yeah, that worked
07:46 🔗 winr4r hurr
07:47 🔗 winr4r i think i installed pip as root
07:47 🔗 winr4r sorry
07:50 🔗 omf_ why does these instructions have you install pip from the package manager and then download and install it again via source
07:50 🔗 ivan` omf_: that is not the case
07:55 🔗 ivan` winr4r meant that he installed seesaw as root
07:56 🔗 omf_ I just tried your debian 6 instructions on Debian 6.0.2 and they fail with dependency problems. Did you mean the newest version of Debian 6.x series?
07:56 🔗 ivan` yes, most likely, sorry
07:57 🔗 ivan` I tested debian-6.0.3-amd64 and debian-6.0.3-i386
07:57 🔗 omf_ here is the output on 6.0.2 http://paste.archivingyoursh.it/sutoyaweso.vhdl
07:58 🔗 ivan` is this a 32-bit machine on which ./wget-lua-warrior runs fine?
07:59 🔗 ivan` I don't know how to resolve that dependency situation
08:00 🔗 omf_ ivan`, did you test debian 7.0 or 7.1?
08:00 🔗 ivan` debian-7.0.0-amd64
08:02 🔗 omf_ ivan`, don't worry about resolving that problem. I added some version info to the readme which should make things clearer https://github.com/ArchiveTeam/greader-directory-grab/blob/master/README.md
08:03 🔗 ivan` thanks
10:02 🔗 ivan` http://www.archiveteam.org/index.php?title=ScreenshotsDatabase.com I think this site died without anybody noticing
10:06 🔗 winr4r ivan`: bluh, the domain was privately registered too so no way of contacting the guy
10:06 🔗 winr4r and the only contact email for the guy was admin@screenshotsdatabase.com which obviously won't work
10:08 🔗 ivan` http://www.atari-forum.com/viewtopic.php?f=14&t=16820 try PMing him on this forum?
10:50 🔗 namespace You know, maybe we should have a bot in the channel.
10:50 🔗 namespace Once a day it pings every site on deathwatch, so that doesn't happen.
10:50 🔗 namespace Maybe even more often than that.
10:51 🔗 ivan` would that have helped in this case?
10:51 🔗 namespace Was it already on deathwatch?
10:51 🔗 ivan` I don't think so
10:51 🔗 namespace That was the impression I got.
10:52 🔗 namespace Oh nevermind then.
10:52 🔗 ivan` more importantly, a bot that yells at people about every unsaved site
10:52 🔗 * namespace shrugs
10:53 🔗 winr4r okay, so the domain still exists
10:53 🔗 winr4r i'm going to email him and see if it bounces
10:54 🔗 namespace Good luck.
10:55 🔗 winr4r sent, let's see what happens
10:59 🔗 winr4r WELL IT DIDN'T BOUNCE YET
11:09 🔗 Baljem surely such a bot would only tell us when it's too late, anyway? *looks puzzled*
11:10 🔗 winr4r YouDunGoofedBot
11:11 🔗 winr4r also
11:11 🔗 winr4r were @me.com mobileme email addresses?
11:12 🔗 winr4r i'm trying to get in contact with another guy who had a big archive of stuff and then his site died
11:17 🔗 Baljem yes, I believe they were - now iCloud, of course
11:23 🔗 ersi winr4r: They should still be working AFAIK
11:36 🔗 winr4r emailed it anyway, we'll find out in a few minutes
15:07 🔗 mistym *You get ops! And *you* get ops!
15:11 🔗 Schbirid because i wanted to make my pc do something i started grabbing the hem.passagen.se sites, 13k after a day
15:35 🔗 Schbirid when using wget to make a warc, the options --adjust-extension and --convert-links are ignored for the warc, correct?
18:39 🔗 Schbirid my hem.passagen.se downloader: https://gist.github.com/SpiritQuaddicted/5ff86e28ad786ced0988
19:50 🔗 Nemo_bis SketchCow: could mirror of FTP sites also be uploaded with the files as is, or must they be compressed into a single archive as usual
19:56 🔗 arrith1 SketchCow: word is you have a contact on the Google Data Liberation team, if at all possible it would be great if you could request from them any and all Google Reader data, but at least a list of feed URLs that they have cached data for.
20:04 🔗 godane uploaded: https://archive.org/details/torrentfreak.com-20130619
20:09 🔗 SketchCow Nemo_bis: Make it an archive
20:09 🔗 SketchCow Anything else blows it up
20:14 🔗 Schbirid oh wow, i forgot -np
20:33 🔗 Nemo_bis SketchCow: ok, but how to make an archive without free disk space, deleting each file as soon as added, without resorting to ugly scripts?
20:33 🔗 Nemo_bis zip -rm deletes only at the end
20:36 🔗 Smiley for x in ./; do zip -command_option_to_add_file; done
20:39 🔗 Smiley hmmm
20:39 🔗 Smiley subnet mask for single IP, /31 or /32 ?
21:36 🔗 Nemo_bis Smiley: that doesn't work too well with huge subdirs
21:36 🔗 Smiley for x in ./{1..something}
21:43 🔗 Baljem find . -exec [do something with '{};'] ?
21:45 🔗 arrith1 there are at least a few stackoverflow (probably serverfault too) posts about bash and dealing with large numbers of items
21:45 🔗 arrith1 one recent one was about deleting large numbers of files, and rsync was the fastest

irclogger-viewer