#archiveteam 2013-06-20,Thu

↑back Search

Time	Nickname	Message
00:01 ^🔗	namespace	Just saying, I've got one in mind that seems to be in that catonic state sites go into for a long time before they lurch over and die.
00:02 ^🔗	namespace	*double over
00:04 ^🔗	arrith1	namespace: there are a few kind of template wget usages
00:05 ^🔗	arrith1	namespace: http://www.archiveteam.org/index.php?title=User:Djsmiley2k
00:05 ^🔗	arrith1	http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
00:05 ^🔗	namespace	arrith1: I know.
00:06 ^🔗	arrith1	wget -e robots=off -r -l 0 -m -p --wait 1 --warc-header "operator: Archive Team" --warc-cdx --warc-file misc-yero-org http://misc.yero.org/modulez/
00:06 ^🔗	arrith1	namespace: oh ok
00:06 ^🔗	namespace	I'll definitely look at those when I set up my grab.
00:06 ^🔗	arrith1	namespace: looks easy enough with recursion. just pointing at some base-ish url and it basically handles itself it looks like
01:07 ^🔗	dashcloud	here's some more, including some more suited to multiple grabs: http://pad.archivingyoursh.it/p/wget-warc
01:07 ^🔗	dashcloud	I did use one of the bottom two for the nwnet.co.uk grab
01:08 ^🔗	dashcloud	I didn't come up with any of them- just wrote them down so I would know what I did for next time
02:40 ^🔗	godane	i'm uploading BBC Test Pilot series
03:24 ^🔗	godane	uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.1
03:24 ^🔗	godane	uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.2
03:24 ^🔗	godane	uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.3
03:24 ^🔗	godane	uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.4
03:25 ^🔗	godane	uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.5
03:25 ^🔗	godane	uploaded: https://archive.org/details/BBC.Test.Pilot.vhsrip.divx.part.6
03:25 ^🔗	godane	so now you have that bbc series
03:26 ^🔗	DFJustin	if you add vhs to the keywords it should show up in https://archive.org/details/digitizedfromvhs
03:32 ^🔗	godane	i add vhs to keywords
03:39 ^🔗	namespace	Question: Should I bother following a crawl delay if it's measured in minutes?
03:42 ^🔗	arrith1	namespace: if it's not time-sensitive (as in the site isn't going away in hours/days) i'd say yeah
03:55 ^🔗	namespace	"If you do not understand the difference between these notations, or do not know which one to use, just use the plain ordinary format you use with your favorite browser, like Lynx or Netscape." Yeah GNU, Lynx is my go-to browser.
03:55 ^🔗	namespace	(Lets you know when this was written, doesn't it?)
05:52 ^🔗	SketchCow	Back
06:02 ^🔗	godane	hey
06:04 ^🔗	BlueMax	hi
06:04 ^🔗	godane	some one should archive this: http://old-blog.boxee.tv/
06:09 ^🔗	namespace	What's the --warc-file option do?
06:09 ^🔗	namespace	It's not mentioned in the manual.
06:10 ^🔗	TrojanEel	http://www.archiveteam.org/index.php?title=Wget_with_WARC_output describes it
06:11 ^🔗	namespace	...
06:11 ^🔗	namespace	Damn, I've got version 1.13
06:12 ^🔗	namespace	Should I go out and install a newer version or?
06:13 ^🔗	Smiley	yes
06:13 ^🔗	BlueMax	http://lifehacker.com/log-in-to-your-yahoo-mail-address-or-lose-it-on-july-1-514371670 this is interesting
06:14 ^🔗	namespace	Smiley: Where do I get it? Duckduckgo is not forthcoming.
06:15 ^🔗	Smiley	i think theres a version on github
06:15 ^🔗	*	Smiley checks
06:16 ^🔗	Smiley	ah it's down
06:16 ^🔗	Smiley	I think the archiveteam github repo has a copy namespace
06:16 ^🔗	namespace	--save-headers is not an acceptable substitute for a small site grab?
06:17 ^🔗	TrojanEel	Also, http://ftp.gnu.org/gnu/wget/wget-1.14.tar.gz
06:17 ^🔗	namespace	I'm not compiling it. -_-
06:17 ^🔗	Smiley	...
06:17 ^🔗	Smiley	why not?
06:17 ^🔗	namespace	I mean I could, but it'd be tedious.
06:17 ^🔗	Smiley	if you don't have it already your distro obviously sucks.
06:17 ^🔗	Smiley	./configure && make is tedious ?
06:18 ^🔗	namespace	Smiley: No no, managing binaries without a package manager is tedious.
06:18 ^🔗	Smiley	well your package manager isn't managing to provide it
06:18 ^🔗	*	Smiley has gone away for wifes birthday.
06:18 ^🔗	DFJustin	if it's not warc it can't be easily imported into the wayback machine, which makes the data a lot less accessible
06:18 ^🔗	namespace	Hmm.
06:19 ^🔗	namespace	And I'm afraid that I won't compile it with the SSL libraries or something.
06:19 ^🔗	namespace	There's nothing worse than dependency hell.
06:23 ^🔗	namespace	Okay fine I'll compile it.
06:23 ^🔗	arrith1	i compiled wget 1.14 recently on debian. just had to do the use openssl configure option
06:23 ^🔗	godane	i'm uploading the millennium celbrations on bbc1
06:23 ^🔗	arrith1	debian 6 that is
06:24 ^🔗	godane	*celebrations
06:25 ^🔗	arrith1	sudo apt-get install -y build-essential openssl libssl-dev
06:25 ^🔗	namespace	Thank you, was just about to search what the compiling package was again.
06:26 ^🔗	Smiley	don't install from the make package
06:26 ^🔗	Smiley	simply put it all elsewhere and call it directly
06:26 ^🔗	arrith1	yeah, i just make then run it in place
06:26 ^🔗	arrith1	symlink into PATH, or just alias, or full path. i like full path myself
06:27 ^🔗	Smiley	yah full path
06:27 ^🔗	Smiley	I have /home/$user/bin/tools/random_stuff_i've_compiled_here
06:27 ^🔗	namespace	I know how to use a binary, I've had to do it more than once, which is of course why I hate doing it..
06:27 ^🔗	arrith1	namespace: ./configure --with-ssl=openssl
06:27 ^🔗	namespace	Is that the only compile option to worry about?
06:27 ^🔗	namespace	I read the manual and it didn't mention any others.
06:28 ^🔗	arrith1	namespace: it worked just like that fine for me on debian 6
06:28 ^🔗	namespace	K.
06:35 ^🔗	namespace	Okay, got it.
06:39 ^🔗	arrith1	good
06:41 ^🔗	namespace	I guess my linux-fu has improved significantly the last time I tried to do this.
06:41 ^🔗	namespace	Because I remember binaries being nothing but trouble.
06:42 ^🔗	arrith1	maybe depends on the binary. one thing is to not do 'make install' because removing it can be tricky
06:59 ^🔗	namespace	And I've started, wish me luck.
07:19 ^🔗	arrith1	gl
07:20 ^🔗	namespace	My biggest fear is actually being too aggressive with the grab.
07:21 ^🔗	namespace	I bandwidth limited to dial-up and did -w 10, and I think I'm using robots.txt, which I guess is enough.
07:22 ^🔗	arrith1	namespace: yeah if it's not time sensitive you can scale back. also to be nice use an honest useragent with "ArchiveTeam" in it somewhere
07:22 ^🔗	namespace	I guess it's not time sensitive, but I'm not sure that wget will resume the download correctly, and how far would you scale back to?
07:23 ^🔗	namespace	(And I used the user agent on smileys user profile.
07:23 ^🔗	arrith1	hm i think your current options are okay
07:24 ^🔗	arrith1	i mean scale back from the defaults of like no wait
07:24 ^🔗	namespace	Yeah, smileys script is much more aggressive than mine.
07:24 ^🔗	namespace	Quarter of a second wait time + explicitly ignores robots.txt
07:29 ^🔗	omf_	after watching people fumble through building wget yet again I have added basic instructions to the wiki http://archiveteam.org/index.php?title=Wget
07:30 ^🔗	namespace	omf_: "fumble" probably isn't the right word. It's not like it took me a very long time to do it.
07:31 ^🔗	namespace	The only reason it took as long as it did is that I haven't compiled anything on this install yet.
07:31 ^🔗	namespace	(And thus had to grab build-essential/libssl/etc)
07:31 ^🔗	omf_	dude you complained the whole time <namespace> I'm not compiling it. -_-
07:32 ^🔗	namespace	omf_: Yeah, that's because I hate compiling binaries for miscellaneous utilities.
07:32 ^🔗	omf_	talk about a first world problem
07:32 ^🔗	namespace	It's not the compiling that sucks, it's the management.
07:32 ^🔗	arrith1	omf_: ./configure --with-ssl=openssl
07:32 ^🔗	arrith1	omf_: on debian 6 i needed: sudo apt-get install -y build-essential openssl libssl-dev
07:32 ^🔗	namespace	Do it more than a few times and finding your shite gets annoying.
07:33 ^🔗	namespace	omf_: And since you're probably not getting that tone out of my voice right now: Thank you.
07:34 ^🔗	omf_	arrith1, The instructions are distro independent. No assumptions are made about which distribution you use. Not everyone uses Debian
07:36 ^🔗	namespace	omf_: One of the great annoyances of giving instructions is that distro-independent instructions are useless, if you know what to change then you don't need the instructions, if you don't know what to change then the instructions will only frustrate you.
07:36 ^🔗	namespace	Then again, the SSL support isn't strictly necessary.
07:36 ^🔗	omf_	if you are grabbing an https site it is
07:37 ^🔗	omf_	which happens a lot
07:37 ^🔗	ivan`	both greader-grab and greader-directory-grab could use a few more downloaders, in case anyone is up for it
07:37 ^🔗	ivan`	11 days left
07:37 ^🔗	winr4r	ivan`: is directory-grab particularly bandwidth-intensive?
07:38 ^🔗	namespace	omf_: Keyword there is strictly. It's probably better to just tell people to grab libssl-dev and to add the configure flag, which is distro independent.
07:39 ^🔗	ivan`	winr4r: no, very low bandwidth
07:39 ^🔗	namespace	(Well maybe not the libssl-dev, the package could be named something else.)
07:39 ^🔗	winr4r	ivan`: fuck yeah i'm on it
07:39 ^🔗	ivan`	thanks
07:39 ^🔗	omf_	first off you do not need the flag if the dependency is properly installed. It is an autotools config
07:39 ^🔗	namespace	omf_: Ah.
07:40 ^🔗	omf_	which is signified by the fact you run ./configure to setup the build
07:43 ^🔗	winr4r	also
07:44 ^🔗	winr4r	someone with commit access should probably fix the README.md for greader-directory-grab because the instructions don't work
07:45 ^🔗	winr4r	# Start downloading with:
07:45 ^🔗	winr4r	screen ~/.local/bin/run-pipeline --disable-web-server pipeline.py YOURNICKNAME
07:45 ^🔗	ivan`	what doesn't work about them
07:45 ^🔗	namespace	Whoever added the "stop immediately" button to the warrior, you are a saint.
07:45 ^🔗	winr4r	~/.local/bin/run-pipeline does not exist
07:46 ^🔗	namespace	But I have to know, what does it do?
07:46 ^🔗	namespace	(Plain shutdown, save state and sleep, ?)
07:46 ^🔗	ivan`	winr4r: it does if you installed seesaw with the pip install --user command
07:46 ^🔗	ivan`	winr4r: maybe you need just run-pipeline?
07:46 ^🔗	winr4r	ivan`: oh, yeah, that worked
07:46 ^🔗	winr4r	hurr
07:47 ^🔗	winr4r	i think i installed pip as root
07:47 ^🔗	winr4r	sorry
07:50 ^🔗	omf_	why does these instructions have you install pip from the package manager and then download and install it again via source
07:50 ^🔗	ivan`	omf_: that is not the case
07:55 ^🔗	ivan`	winr4r meant that he installed seesaw as root
07:56 ^🔗	omf_	I just tried your debian 6 instructions on Debian 6.0.2 and they fail with dependency problems. Did you mean the newest version of Debian 6.x series?
07:56 ^🔗	ivan`	yes, most likely, sorry
07:57 ^🔗	ivan`	I tested debian-6.0.3-amd64 and debian-6.0.3-i386
07:57 ^🔗	omf_	here is the output on 6.0.2 http://paste.archivingyoursh.it/sutoyaweso.vhdl
07:58 ^🔗	ivan`	is this a 32-bit machine on which ./wget-lua-warrior runs fine?
07:59 ^🔗	ivan`	I don't know how to resolve that dependency situation
08:00 ^🔗	omf_	ivan`, did you test debian 7.0 or 7.1?
08:00 ^🔗	ivan`	debian-7.0.0-amd64
08:02 ^🔗	omf_	ivan`, don't worry about resolving that problem. I added some version info to the readme which should make things clearer https://github.com/ArchiveTeam/greader-directory-grab/blob/master/README.md
08:03 ^🔗	ivan`	thanks
10:02 ^🔗	ivan`	http://www.archiveteam.org/index.php?title=ScreenshotsDatabase.com I think this site died without anybody noticing
10:06 ^🔗	winr4r	ivan`: bluh, the domain was privately registered too so no way of contacting the guy
10:06 ^🔗	winr4r	and the only contact email for the guy was admin@screenshotsdatabase.com which obviously won't work
10:08 ^🔗	ivan`	http://www.atari-forum.com/viewtopic.php?f=14&t=16820 try PMing him on this forum?
10:50 ^🔗	namespace	You know, maybe we should have a bot in the channel.
10:50 ^🔗	namespace	Once a day it pings every site on deathwatch, so that doesn't happen.
10:50 ^🔗	namespace	Maybe even more often than that.
10:51 ^🔗	ivan`	would that have helped in this case?
10:51 ^🔗	namespace	Was it already on deathwatch?
10:51 ^🔗	ivan`	I don't think so
10:51 ^🔗	namespace	That was the impression I got.
10:52 ^🔗	namespace	Oh nevermind then.
10:52 ^🔗	ivan`	more importantly, a bot that yells at people about every unsaved site
10:52 ^🔗	*	namespace shrugs
10:53 ^🔗	winr4r	okay, so the domain still exists
10:53 ^🔗	winr4r	i'm going to email him and see if it bounces
10:54 ^🔗	namespace	Good luck.
10:55 ^🔗	winr4r	sent, let's see what happens
10:59 ^🔗	winr4r	WELL IT DIDN'T BOUNCE YET
11:09 ^🔗	Baljem	surely such a bot would only tell us when it's too late, anyway? looks puzzled
11:10 ^🔗	winr4r	YouDunGoofedBot
11:11 ^🔗	winr4r	also
11:11 ^🔗	winr4r	were @me.com mobileme email addresses?
11:12 ^🔗	winr4r	i'm trying to get in contact with another guy who had a big archive of stuff and then his site died
11:17 ^🔗	Baljem	yes, I believe they were - now iCloud, of course
11:23 ^🔗	ersi	winr4r: They should still be working AFAIK
11:36 ^🔗	winr4r	emailed it anyway, we'll find out in a few minutes
15:07 ^🔗	mistym	You get ops! And you* get ops!
15:11 ^🔗	Schbirid	because i wanted to make my pc do something i started grabbing the hem.passagen.se sites, 13k after a day
15:35 ^🔗	Schbirid	when using wget to make a warc, the options --adjust-extension and --convert-links are ignored for the warc, correct?
18:39 ^🔗	Schbirid	my hem.passagen.se downloader: https://gist.github.com/SpiritQuaddicted/5ff86e28ad786ced0988
19:50 ^🔗	Nemo_bis	SketchCow: could mirror of FTP sites also be uploaded with the files as is, or must they be compressed into a single archive as usual
19:56 ^🔗	arrith1	SketchCow: word is you have a contact on the Google Data Liberation team, if at all possible it would be great if you could request from them any and all Google Reader data, but at least a list of feed URLs that they have cached data for.
20:04 ^🔗	godane	uploaded: https://archive.org/details/torrentfreak.com-20130619
20:09 ^🔗	SketchCow	Nemo_bis: Make it an archive
20:09 ^🔗	SketchCow	Anything else blows it up
20:14 ^🔗	Schbirid	oh wow, i forgot -np
20:33 ^🔗	Nemo_bis	SketchCow: ok, but how to make an archive without free disk space, deleting each file as soon as added, without resorting to ugly scripts?
20:33 ^🔗	Nemo_bis	zip -rm deletes only at the end
20:36 ^🔗	Smiley	for x in ./; do zip -command_option_to_add_file; done
20:39 ^🔗	Smiley	hmmm
20:39 ^🔗	Smiley	subnet mask for single IP, /31 or /32 ?
21:36 ^🔗	Nemo_bis	Smiley: that doesn't work too well with huge subdirs
21:36 ^🔗	Smiley	for x in ./{1..something}
21:43 ^🔗	Baljem	find . -exec [do something with '{};'] ?
21:45 ^🔗	arrith1	there are at least a few stackoverflow (probably serverfault too) posts about bash and dealing with large numbers of items
21:45 ^🔗	arrith1	one recent one was about deleting large numbers of files, and rsync was the fastest

irclogger-viewer