#archiveteam 2012-10-10,Wed

↑back Search

Time	Nickname	Message
00:09 ^🔗	SketchCow	WE'LL FIND OUT
00:24 ^🔗	godane	SketchCow: I'm grabing all of offical xbox magazine podcast
00:24 ^🔗	godane	there is like 311 podcast
00:25 ^🔗	godane	*podcasts
00:25 ^🔗	godane	i'm uploading the rest of no bs podcast now too
02:49 ^🔗	dashcloud	so, I've got all the laptop service manuals from dell's ftp- someone have a place I can upload them to?
11:37 ^🔗	joepie91	alard: is btinternet a warrior project yet?
11:38 ^🔗	alard	Yes, it's more or less ready (barring any new insights) but it's not actually on the warrior.
11:39 ^🔗	alard	https://github.com/ArchiveTeam/btinternet-grab
11:39 ^🔗	alard	Is ready to go.
11:39 ^🔗	alard	(Almost.)
11:40 ^🔗	alard	Why?
11:42 ^🔗	joepie91	well, when it's done, my warrior has something important to do :P
11:43 ^🔗	alard	We should keep looking for more usernames, though.
11:43 ^🔗	alard	I added the sites from DMOZ, from the wayback machine and am waiting for the btinternet links on tvtropes.org.
11:45 ^🔗	joepie91	alright
11:49 ^🔗	alard	I'm now downloading the wikipedia dump as well.
11:50 ^🔗	joepie91	wikipedia dump? as in, find btinternet links on wikipedia?
11:50 ^🔗	joepie91	speaking of which.. I'll have a look in the stackexchange dump
11:50 ^🔗	joepie91	I have it here locally
11:53 ^🔗	alard	joepie91: Yes, bunzip2 \| grep ...
11:54 ^🔗	alard	It seems that there are a few links on Wikipedia: https://encrypted.google.com/search?hl=en&q=site%3Awikipedia.org%20btinternet.co.uk
11:55 ^🔗	joepie91	oh goddamnit, I removed the stackexchange data dump a few days ago
11:55 ^🔗	joepie91	redownload time
11:59 ^🔗	SmileyG	alard: I think "all Projects" tab in warrior should be "Choose Project" ?
12:00 ^🔗	alard	SmileyG: Perhaps. But "Choose" is a verb. "Settings" is not. Is "Available projects" a solution?
12:01 ^🔗	SmileyG	Yeah that works
12:01 ^🔗	SmileyG	Currently I'd think "All Projects" would select all projects...... make sense?
12:03 ^🔗	alard	Yes, I think I understand your point. (Although you could also say that it's a tab, not a button, so it shows you "all projects", like it does.)
12:03 ^🔗	SmileyG	Hehe
12:03 ^🔗	joepie91	UI design is hard :P
12:04 ^🔗	SmileyG	Well I have a habit of reading things differently to others, but I was good at it at uni. :S
12:04 ^🔗	alard	It's fun.
12:15 ^🔗	alard	http://tracker.archiveteam.org/btinternet/
12:16 ^🔗	alard	(Don't go too fast.)
12:16 ^🔗	BlueMaxim	thanks for reminding me to see how webshots was doing :P
12:16 ^🔗	BlueMaxim	underscor with 2364GB.
12:16 ^🔗	BlueMaxim	I'm going to kill him one day
12:19 ^🔗	balrog_	why does it say only 8 items done so far? :P
12:19 ^🔗	balrog_	oh I see...
12:19 ^🔗	balrog_	nvm :P
12:21 ^🔗	alard	balrog_: You could be number 1 with 9!
12:22 ^🔗	balrog_	alard: do I have to use warrior?
12:22 ^🔗	balrog_	:\|
12:23 ^🔗	alard	What's wrong with the warrior? It's a small project.
12:23 ^🔗	balrog_	takes up more ram and cpu on my side :/
12:23 ^🔗	BlueMaxim	It's pretty minimal how much it takes up.
12:24 ^🔗	joepie91	BlueMaxim: not exactly
12:24 ^🔗	joepie91	it uses up to 20% of my 4GB of RAM
12:24 ^🔗	SmileyG	how long til bt ties?
12:24 ^🔗	joepie91	usually around 13
12:24 ^🔗	BlueMax	joepie91, seriously? I thought it only needed 256MB of RAM
12:24 ^🔗	joepie91	BlueMax: that's the VM itself - apparently virtualbox adds a bunch of overhead on top of that
12:24 ^🔗	joepie91	also
12:24 ^🔗	joepie91	or something
12:24 ^🔗	joepie91	it's quite heavy on CPU
12:24 ^🔗	joepie91	on my shitty notebook i3
12:25 ^🔗	joepie91	2 x 1,3ghz
12:25 ^🔗	Cameron_D	hm, these are synging like 6kb
12:25 ^🔗	Cameron_D	*syncing
12:25 ^🔗	Cameron_D	I guess if there is only one page
12:26 ^🔗	Cameron_D	Oh, 404 error, even smaller
12:26 ^🔗	BlueMax	guess I didn't notice
12:27 ^🔗	BlueMax	My computer must be better at this than I thought :P
12:28 ^🔗	SmileyG	can the tracker do less more verbose than 0Mb?
12:30 ^🔗	Deewiant	I ran into virtualbox using 7 gigabytes of ram before it got OOM-killed
12:30 ^🔗	Deewiant	While running the warrior a few days back
12:31 ^🔗	joepie91	lot of 0MBs
12:31 ^🔗	joepie91	lol
12:31 ^🔗	BlueMax	memory leak to the max :P
12:32 ^🔗	SmileyG	alard: when should it start new processes :S, I've got it set to 6 but it still only shows 4?
12:33 ^🔗	alard	SmileyG: When an item finishes the warrior checks the number to see how many new items there should be.
12:33 ^🔗	SmileyG	hmmm k
12:33 ^🔗	SmileyG	ones just finished, lets ee if it works this time
12:33 ^🔗	SmileyG	Also I've changed to BT but the banner still shows webshots (I presume because some of the jobs are still webshots).
12:34 ^🔗	joepie91	have there ever been archiving/warrior projects where the warriors were throttled/rate-limited/blocked?
12:34 ^🔗	joepie91	SmileyG: it will first finish the webshots jobs
12:34 ^🔗	joepie91	then move on to BT
12:34 ^🔗	SmileyG	I rate limit mine joepie91 :P
12:34 ^🔗	joepie91	oooo, 39MB
12:34 ^🔗	alard	The warrior can't run multiple projects at the same time, so yes, it waits for webshots to complete.
12:35 ^🔗	SmileyG	ok, makes sense :D
12:35 ^🔗	alard	(Also: why not keep it on webshots? I expect btinternet won't take long.)
12:35 ^🔗	BlueMax	it'd be cool if it could multitask
12:35 ^🔗	BlueMax	one process on one project, four on another
12:35 ^🔗	SmileyG	I have a webshots running at work on 5Mbit, this is amazingly slow compared to that ;)
12:42 ^🔗	joepie91	alard: http://www.quickonlinetips.com/archives/2012/09/google-feedburner-shutting-down/
12:43 ^🔗	joepie91	not sure if there's any useful data on feedburner
12:43 ^🔗	joepie91	but sure looks like signs of imminent death
12:43 ^🔗	joepie91	also http://searchenginewatch.com/article/2213759/Google-Shutting-Down-AdSense-for-Feeds-Classic-Plus-More-Services?utm_source=twitterfeed&utm_medium=twitter
12:43 ^🔗	alard	Isn't that just a proxy/cache/stats service?
12:43 ^🔗	Cameron_D	Yeah, it is a stats tracking service for RSS feeds
12:44 ^🔗	Cameron_D	So thousands of RSS feeds will break
12:44 ^🔗	Cameron_D	but they don't really host much data
12:44 ^🔗	joepie91	this may also be a problem for THQ-related sites: http://www.gamearena.com.au/news/read.php/5116588
12:44 ^🔗	joepie91	THQ Asia Pacific shutting down
12:44 ^🔗	godane	i got to grab my t3 magazine podcast then
12:45 ^🔗	joepie91	are there any THQ Asia Pacific-run sites that have user content?
12:45 ^🔗	Cameron_D	looking now
12:46 ^🔗	godane	Carmeron_D: links to a lot podcasts and stuff could be lost
12:47 ^🔗	godane	http://feeds.feedburner.com/T3/podcast
12:48 ^🔗	Cameron_D	feedburner just acts as a proxy though (To collect stats)
12:48 ^🔗	Cameron_D	Somewhere on the t3 site is the actual feed
12:48 ^🔗	Cameron_D	At least that is how I remember it working
12:49 ^🔗	godane	but that feed i think doesn't go back that far
12:50 ^🔗	joepie91	Cameron_D: also as an aggregrator afaik
12:50 ^🔗	godane	there only feed is from feedburner
13:07 ^🔗	balrog_	the warrior image has issues
13:08 ^🔗	balrog_	first off, vmware complains that it doesn't meet ova specs
13:08 ^🔗	balrog_	second, I get an error that there's an ide slave with no master
13:08 ^🔗	alard	balrog_: Which image?
13:09 ^🔗	alard	20121008?
13:09 ^🔗	balrog_	archiveteam-warrior-v2-20121008
13:09 ^🔗	balrog_	yes
13:09 ^🔗	Cameron_D	http://dmorton.staff.hostgator.com/archiveteam-warrior-vmware.ova vmware-compatible (albeit an older version)
13:09 ^🔗	balrog_	why did this one break?
13:10 ^🔗	alard	I don't know about the ova specs. There previously was a problem with the filename. I had exported the image as archiveteam-warrior-v2.ova, and then renamed it to include the date. This new image is exported with the correct name.
13:10 ^🔗	alard	And IDE slave with no master, that seems to be a virtualbox - vmware incompatibility.
13:10 ^🔗	balrog_	The import failed because /path/to/archiveteam-warrior-v2-20121008.ova did not pass OVF specification conformance or virtual hardware compliance checks. Click Retry to relax OVF specification and virtual hardware compliance checks and try the import again, or click Cancel to cancel the import. If you retry the import, you might not be able to use the virtual machine in VMware Fusion.
13:11 ^🔗	alard	I've added two disks in VirtualBox, but for some reason VMware ends up with two controllers: 1-master for disk 1, 2-slave for disk 2.
13:11 ^🔗	balrog_	and then ... There is an IDE slave with no master at ide1:1. This configuration does not work correctly in virtual machines. Move the disk/CD-ROM from ide1:1 to ide1:0 using the configuration editor.
13:12 ^🔗	balrog_	I wouldn't be surprised if VBox is malforming the ova
13:12 ^🔗	balrog_	VBox is unfortunately full of bugs
13:13 ^🔗	Cameron_D	heh, ESXi still rejects the file too http://i.imgur.com/z3Kox.png
13:14 ^🔗	balrog_	hm, they have an OVF tool
13:16 ^🔗	S[h]O[r]T	balrog_
13:16 ^🔗	S[h]O[r]T	are you running vmware workstation?
13:16 ^🔗	balrog_	no, fusion
13:16 ^🔗	balrog_	which is basically the mac version of workstation
13:17 ^🔗	S[h]O[r]T	when i first imported archiveteam-warrior-v2-20120813 i got the error about it not being valid. then i just imported again and it worked.
13:17 ^🔗	S[h]O[r]T	i got the ide error as well after that too
13:17 ^🔗	balrog_	yeah but I keep getting the ide error
13:17 ^🔗	S[h]O[r]T	you just have to go into the settings and change the second drive to ide0:1
13:17 ^🔗	S[h]O[r]T	from ide 1:0
13:23 ^🔗	balrog_	hmm
13:23 ^🔗	balrog_	what if someone imported the vm into vmware, fixed it, and exported it?
13:23 ^🔗	balrog_	I wonder if the ova file would be more up-to-spec
13:25 ^🔗	S[h]O[r]T	youd probably want to export as a vmdk or wahtever the vmware equivlent is. you can always just rar up the vmdk files and if someone uses them vmware will just ask if they copied it
13:25 ^🔗	joepie91	alard: btinternet\.(com\|co\.uk)
13:25 ^🔗	joepie91	right?
13:25 ^🔗	balrog_	ova is better if it's compatible
13:25 ^🔗	balrog_	err, compliant
13:25 ^🔗	balrog_	apparently vbox does't produce compliant files
13:26 ^🔗	joepie91	bingo
13:26 ^🔗	joepie91	http://www.btinternet.com/~se16/hgb/statjoke.htm
13:26 ^🔗	joepie91	se16 :P
13:27 ^🔗	godane	uploaded: http://archive.org/details/cdrom-linuxformatmagazine-76
13:27 ^🔗	alard	joepie91: Yes, and then www\.(.+)\.btinternet or /~([^%?/]+)
13:28 ^🔗	SmileyG	Final webshots rsync finishes in a few min and then bt ':D
13:29 ^🔗	joepie91	alard: I've also seen a few without www in front
13:29 ^🔗	joepie91	and just the username
13:31 ^🔗	joepie91	alard: 7z e -so *.7z \| grep -P "(([^\s(/]+)\.)?btinternet\.(com\|co\.uk)(\/~([^/ %?]+))?"
13:31 ^🔗	joepie91	:)
13:31 ^🔗	joepie91	will take a few hours for the torrent to finish downloading
13:31 ^🔗	joepie91	after that, that will yield all the relevant entries
13:36 ^🔗	joepie91	better:
13:36 ^🔗	joepie91	7z e -so *.7z 2> /dev/null \| grep -Po "(([^\s(/]+)\.)?btinternet\.(com\|co\.uk)(\/~([^/ %?]+))?"
13:57 ^🔗	balrog_	how well does warrior handle a network connection change?
14:01 ^🔗	balrog_	how well does warrior handle a network connection change?
14:01 ^🔗	balrog_	also, why no rsync with continue?
14:05 ^🔗	SmileyG	balrog_: it should back off then continue once it figures it out
14:06 ^🔗	balrog_	you mean with the wget?
14:06 ^🔗	balrog_	rsync seems to lack continue though...
14:08 ^🔗	alard	Doesn't --partial-dir enable --partial?
14:08 ^🔗	alard	(Just rsync --partial is dangerous in this case, since SketchCow will move any file in the upload directory.)
14:22 ^🔗	willwill	Hey there, if you see my name on uncompleted webshots job please release the lock.
14:25 ^🔗	alard	willwill: No problem. (There will probably be other failed jobs, so I'll requeue them all at once later.)
14:46 ^🔗	SmileyG	balrog_: rsync, continue?
14:46 ^🔗	SmileyG	rsync knows what its sent and it doesn't require continue
14:46 ^🔗	balrog_	resume rather
14:47 ^🔗	balrog_	--partial or -P switch
14:47 ^🔗	SmileyG	doesn'tneed it....
14:47 ^🔗	SmileyG	partial does partial files
14:48 ^🔗	SmileyG	rsync checks for each file as it goes
14:48 ^🔗	balrog_	yeah well a single .warc is pretty large
14:48 ^🔗	balrog_	and if it gets interrupted, whole thing has to start over
14:48 ^🔗	SmileyG	yeah true, then your screwed :S
14:52 ^🔗	alard	I've added --partial to btinternet, so the next project will have it too.
14:52 ^🔗	SmileyG	Isn't that going to cause issues as you highlighted earlier?
14:52 ^🔗	alard	No, because --partial-dir keeps the partial files in a separate directory.
14:53 ^🔗	alard	They're uploaded to the .rsync-tmp/ subdirectory and moved when they're uploaded.
14:54 ^🔗	alard	I thought --partial-dir would be enough, but apparently you need --partial too.
14:55 ^🔗	SmileyG	oooo
14:55 ^🔗	SmileyG	heh thats random devs for you
14:59 ^🔗	joepie91	alard: the title in the btinternet pipeline.py is still webshots
14:59 ^🔗	joepie91	;)
15:02 ^🔗	alard	I see. And apparently the title isn't used anywhere.
15:03 ^🔗	alard	Wikipedia produced 933 new btinternet names.
15:04 ^🔗	joepie91	:D
15:04 ^🔗	joepie91	I'm searching math stackexchange now
15:04 ^🔗	SmileyG	wikipedia? :o
15:04 ^🔗	joepie91	alard: stats stackexchange produced "se16" as only username
15:06 ^🔗	joepie91	it's referenced a lot on math. as well
15:06 ^🔗	joepie91	seems like a pretty important site
15:06 ^🔗	joepie91	ha
15:06 ^🔗	joepie91	Think twice before using BT as an ISP.
15:06 ^🔗	joepie91	on the homepage of that site
15:06 ^🔗	joepie91	BT used to provide its internet subscribers with a small amount of personal webspace, but did not promote the service so only the oldest most loyal customers used it. Now it now longer wishes to satisfy these customers and is closing the service down. So this page and others of mine, which have received over 2 million hits in 13 years, have to move.
15:06 ^🔗	joepie91	If your browser does not automatically go to http://www.se16.info/index.htm within a few seconds, you may want to go to the destination manually.
15:06 ^🔗	joepie91	My conclusion is that if you ever consider BT as a possible ISP for some reason, you should not expect that reason to last.
15:07 ^🔗	SmileyG	yah
15:09 ^🔗	alard	joepie91: We already had it. :) Processed items: 1, added to main queue: 0
15:12 ^🔗	joepie91	alright :P
15:12 ^🔗	joepie91	brb
15:14 ^🔗	DoubleJ	alard: Quick question about the warrior: If there are multiple warcs waiting to upload, how does it decide which one goes next?
15:15 ^🔗	alard	LIFO, I think, but if you really want to know you should check here: https://github.com/ArchiveTeam/seesaw-kit/blob/master/seesaw/task.py#L72-107
15:17 ^🔗	DoubleJ	I... have no idea what I'm looking at.
15:18 ^🔗	DoubleJ	But since it looks like array manipulation, I'm guessing my request to do smallest file first is a no-go.
15:19 ^🔗	alard	That would be hard, I think. Then the queueing thing would have to know about file sizes.
15:19 ^🔗	alard	And does it really matter?
15:19 ^🔗	DoubleJ	Kinda-maybe. It'd free up more threads to download quicker.
15:20 ^🔗	DoubleJ	As it is there are times when all my worker threads are waiting for one upload to finish so they can go.
15:20 ^🔗	DoubleJ	Of course then you'd have a problem with large files never uploading, but you could conceivably have that with LIFO as well and I haven't seen it happen yet.
15:22 ^🔗	alard	Maybe the upload limit should just go.
15:23 ^🔗	alard	Some people wanted it in the previous warrior.
15:23 ^🔗	SmileyG	I limit the VM, shrug.
15:23 ^🔗	DoubleJ	Upload limit, as in throughput, or as inwaiting turns?
15:24 ^🔗	alard	Waiting turns. I think the thinking then was that one rsync uploads faster, so can start downloading sooner.
15:24 ^🔗	alard	The opposite of what you say now, basically. :)
15:24 ^🔗	DoubleJ	I can kinda see that, since the overhead for switching wouldn't help overall.
15:24 ^🔗	SmileyG	wasn't it because the upload location was really slow at one point?
15:24 ^🔗	SmileyG	and no one could finish anything :D
15:24 ^🔗	SmileyG	ended up eating all the space on the warriors.
15:25 ^🔗	DoubleJ	Is there someplace I can set it to let 2 upload at once, see if there are any wins to be had that way?
15:26 ^🔗	SmileyG	yup
15:26 ^🔗	SmileyG	you running vm?
15:26 ^🔗	SmileyG	I have upto 6 uploads at once.
15:26 ^🔗	DoubleJ	Yes.
15:26 ^🔗	SmileyG	ok, on the vm window
15:26 ^🔗	SmileyG	alt+F3
15:26 ^🔗	DoubleJ	OK, log in to the VM. Got that.
15:26 ^🔗	SmileyG	nano -w /home/warrior/projects/webshots/pipeline.py
15:27 ^🔗	SmileyG	ctrl+w
15:27 ^🔗	DoubleJ	(Well, I will have that, about 6:00 tonight. can't access theVM from work :) )
15:27 ^🔗	SmileyG	Ah ok
15:27 ^🔗	SmileyG	I need to do a page on this on the wiki
15:27 ^🔗	DoubleJ	But keep going. I'll check the scrollback tonight.
15:29 ^🔗	DoubleJ	alard: Dunno what project it was requested for, but webshots may just be a different critter. Large variation in upload sizes. Waiting is probably still good, we just might want to be smarter about the criteria for deciding who's next :)
15:29 ^🔗	DoubleJ	But the current warrior wins on simplicity.
15:29 ^🔗	alard	Is it worth removing the limit?
15:29 ^🔗	SmileyG	type LimitConcurrent and hit enter, and change the 1 to 6 (or whatever figure)
15:29 ^🔗	DoubleJ	(At least, I think it does. I can read Python about as well as I can read Japanese. (Not at all.))
15:30 ^🔗	DoubleJ	I'll try mine tonight. It may let smaller files squeak out, butit may also take longer because of drive-spinning at either end.
15:32 ^🔗	alard	Word of caution: if you change the pipeline.py in your warrior, you may break future updates. (If git can't figure out how to apply the update to your modified version.)
15:32 ^🔗	SmileyG	heh, i seem to have breoken it anyway Â¬_Â¬
15:32 ^🔗	SmileyG	still getting no output
15:33 ^🔗	alard	Stop the project, go into your warrior and use git pull to figure out what's wrong?
15:33 ^🔗	DoubleJ	Understood. But define "break". Update won't apply, warrior will conk out, house burns down, what?
15:33 ^🔗	alard	I think you can expect the SmileyG problem.
15:34 ^🔗	DoubleJ	Ah.
15:34 ^🔗	SmileyG	webserver runs, nothing else does :D
15:34 ^🔗	alard	So you'll have to login, use git pull to figure out what's going wrong.
15:34 ^🔗	DoubleJ	And as we're talking about it my 261-meg user finishes:)
15:35 ^🔗	primus	alard, would it work to just delete project and restart warrior?
15:35 ^🔗	SmileyG	alard: I'd vote for keep the limit, but add option to change it.
15:35 ^🔗	alard	SmileyG: Is that worth stopping every warrior? (That's what happens if I push an update. Every warrior will finish its current task and restart the project.)
15:36 ^🔗	alard	primus: That would work.
15:36 ^🔗	SmileyG	alard: can't you just do the update and let them pull it in time?
15:36 ^🔗	DoubleJ	Yeah, restarting warriors on this project I think is worse.
15:36 ^🔗	alard	Define "in time"?
15:36 ^🔗	SmileyG	when ever they restart their vm?
15:36 ^🔗	alard	No. They check for updates on github.
15:36 ^🔗	SmileyG	Also, add "Check for updates" button to settings page?
15:36 ^🔗	DoubleJ	Heh. Like Windows Update. "Updates to this warrior are now available. Apply? This may require your warrior to restart."
15:36 ^🔗	primus	lol
15:37 ^🔗	SmileyG	where do I run the git pull?
15:37 ^🔗	alard	What we should have, in a future version, is a gradual update.
15:37 ^🔗	alard	cd /home/warrior/projects/$project/
15:37 ^🔗	alard	(perhaps su -u warrior first)
15:38 ^🔗	SmileyG	hmmm its moanin about the changes in pipeline
15:39 ^🔗	*	SmileyG changs it back and git pulls
15:39 ^🔗	DoubleJ	It'd probably be an awful bitch, but would the multiple-project stuff be useful for that? So /home/warrior/projects/$project.$version instead? Let one run out while the new one sees threadsdisappear and spins up?
15:39 ^🔗	DoubleJ	s/stuff/idea/
15:40 ^🔗	SmileyG	alard: ok I see the new rsync code...
15:40 ^🔗	SmileyG	need to restart the warrior for web interface to update?
15:41 ^🔗	SmileyG	or is it only set via the code (And won't this then cause git to explode again?)
15:41 ^🔗	SmileyG	:O
15:41 ^🔗	SmileyG	ITS GONE CRAZY
15:41 ^🔗	SmileyG	15 users and counting on one screen
15:43 ^🔗	SmileyG	There we go...
15:43 ^🔗	SmileyG	that is bonkers when it first starts up
15:43 ^🔗	SmileyG	you just see hundreds of boxes popping up
15:44 ^🔗	SmileyG	alard: I remember - The script to create the 50Gb tars couldn't keep up for fortuneCity, thats why the rsync got limited.
15:54 ^🔗	alard	DoubleJ: Yes, that's similar. (I was thinking it might be better to have the cloned git repo in /home/warrior/projects/$project, as the most up-to-date version, then do a clone to /data/projects/$project.$version before starting a project.)
16:37 ^🔗	alard	Have we killed fos?
16:38 ^🔗	SmileyG	:O
16:39 ^🔗	SmileyG	2Kb/s! \o/
16:39 ^🔗	SmileyG	Oh its coming back now
16:40 ^🔗	SmileyG	Planned Delivery Date
16:40 ^🔗	SmileyG	Wednesday 10th October
16:40 ^🔗	SmileyG	Planned Delivery Time
16:40 ^🔗	SmileyG	Between 07:30 and 17:30
16:40 ^🔗	SmileyG	Wed Oct 10 17:40:33 BST 2012
16:40 ^🔗	SmileyG	HERP?
17:08 ^🔗	joepie91	HEY
17:08 ^🔗	SmileyG	yeah the uploads are totally dead?
17:08 ^🔗	joepie91	primus
17:08 ^🔗	joepie91	:(
17:08 ^🔗	joepie91	you've overtaken me
17:08 ^🔗	joepie91	SmileyG: ?
17:08 ^🔗	SmileyG	4587520 39% 12.21kB/s 0:09:45
17:08 ^🔗	SmileyG	[sender] io timeout after 300 seconds -- exiting
17:09 ^🔗	joepie91	sec
17:09 ^🔗	joepie91	wtf, mine is dead
17:09 ^🔗	SmileyG	Retrying RsyncUpload for Item jpr.tree after 30 seconds...
17:13 ^🔗	SmileyG	.... brokeyd :D
17:13 ^🔗	SmileyG	alard: did you break something :(
17:21 ^🔗	joepie91	my rsyncs are dying..
17:21 ^🔗	joepie91	rsync: failed to connect to fos.textfiles.com: Connection timed out (110)
17:21 ^🔗	joepie91	Process RsyncUpload returned exit code 10 for Item andrewjjstanley
17:21 ^🔗	joepie91	Retrying RsyncUpload for Item andrewjjstanley after 30 seconds...
17:21 ^🔗	joepie91	rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.7]
17:22 ^🔗	SmileyG	yah
17:22 ^🔗	SmileyG	:<
17:23 ^🔗	SmileyG	they retry, but still its killed all progress :<
17:23 ^🔗	joepie91	oh
17:23 ^🔗	joepie91	they run now
17:24 ^🔗	alard	http://isup.me/fos.textfiles.com
17:26 ^🔗	alard	I think this is a SketchCow problem.
17:27 ^🔗	SmileyG	:<
17:27 ^🔗	alard	(The warriors will retry 50 times with 30 second pauses before they fail.)
17:28 ^🔗	SmileyG	:< herp.
17:34 ^🔗	joepie91	alard: it responds to ping
17:46 ^🔗	SmileyG	alard: se16 0MB << hey look :D
18:21 ^🔗	joepie91	SmileyG: mmm
18:21 ^🔗	joepie91	it's probably because he replaced the index page
18:22 ^🔗	SmileyG	joepie91: yeah I figured it might be that.
18:22 ^🔗	SmileyG	well it makes sense, the script forwards you off site.
18:41 ^🔗	underscor	fos is currently down-ish
18:41 ^🔗	underscor	fyi
18:41 ^🔗	chronomex	ish
18:41 ^🔗	chronomex	how can a box be down-ish
18:42 ^🔗	SketchCow	He's mincing words.
18:42 ^🔗	underscor	it still pings
18:42 ^🔗	SketchCow	It's down.
18:42 ^🔗	SketchCow	It's superdown.
18:42 ^🔗	underscor	VMs at archive have 3 states. Up, nossh/services, and noping
18:43 ^🔗	underscor	anyway, yeah, it's turbofucked
18:46 ^🔗	Nemo_bis	how does tpb fetch Google Books' stuff? does it accept suggestions? http://lists.wikimedia.org/pipermail/wikisource-l/2012-October/001204.html
18:49 ^🔗	underscor	wait
18:49 ^🔗	underscor	how is rsync still working if fos is down :O
19:13 ^🔗	SketchCow	OKAY HI
19:13 ^🔗	SketchCow	NEED HELP
19:14 ^🔗	SketchCow	https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0
19:14 ^🔗	SketchCow	OK, that's a listing of all archiveteam projects on archive.org.
19:14 ^🔗	SketchCow	1. Please see if I missed any.
19:15 ^🔗	SketchCow	(i.e. just browse through the archiveteam set to see)
19:28 ^🔗	underscor	haha, I love the item counts
19:28 ^🔗	underscor	26, 70, 29, 3956
19:35 ^🔗	chronomex	is IA down? not working for me.
19:39 ^🔗	godane	its not working for me too
19:39 ^🔗	chronomex	k
19:42 ^🔗	SmileyG	SketchCow: you missed the most famous of all - geocities.
19:45 ^🔗	joepie91	heh
19:45 ^🔗	joepie91	okay, maybe a recursive grep through my entire repository folder was a bad idea
19:46 ^🔗	alard	Geocities isn't warc.
19:46 ^🔗	underscor	IA is fucked right now
19:46 ^🔗	underscor	please leave a message after the beep
19:46 ^🔗	underscor	:D
19:46 ^🔗	*	chronomex waits for the beep
19:46 ^🔗	underscor	boop
19:46 ^🔗	*	SmileyG hears helicopters
19:47 ^🔗	underscor	But yeah, it's down. Once of the core boxes decided to take a dump all over everything, people are working on fixing now
19:47 ^🔗	chronomex	ok, I'm not in a hurry
19:47 ^🔗	joepie91	underscor: wat
19:47 ^🔗	joepie91	IA went down?
19:48 ^🔗	underscor	it's down right now
19:48 ^🔗	SmileyG	we broke it Â¬_Â¬
19:48 ^🔗	underscor	lol
19:49 ^🔗	joepie91	oh wow
19:50 ^🔗	alard	Can't edit the list, but Cinch is missing. City of Heroes (two items, I think: boards and www).
19:52 ^🔗	alard	Qaudio.
20:04 ^🔗	joepie91	god I hate efnet
20:05 ^🔗	joepie91	anyway
20:05 ^🔗	joepie91	is anyone up for testing a useful script?
20:05 ^🔗	joepie91	wrote a script that takes a glob pattern, then tries to figure out (from extension) what kind of archive each file is, and prints the decompressed contents to stdout using the appropriate application, without actually unpacking it
20:05 ^🔗	joepie91	consider it a 'cat' for archives :)
20:14 ^🔗	SmileyG	so like zcat?
20:15 ^🔗	chronomex	igelritte: you know you can be in multiple channels at once, right?
20:15 ^🔗	underscor	igelritte: yeah, most of us are in both
20:16 ^🔗	chronomex	well, actually, I don't know how to do it with pidgin
20:16 ^🔗	chronomex	but I think you can
20:16 ^🔗	underscor	just /j #channel1 and /j #channel2
20:16 ^🔗	underscor	they open up as tabs
20:16 ^🔗	underscor	at least in my pidgin
20:17 ^🔗	igelritte	yeah, I didn't think about it
20:17 ^🔗	igelritte	whateve's. I'm here now
20:18 ^🔗	chronomex	k
20:19 ^🔗	igelritte	so, tell me more about your structure and how one can plug in.
20:21 ^🔗	igelritte	Is it some starry-eyed-open-source-free-for-all? Or is there a process wherein you tell a gatekeeper what you can do, what you're experienced with, and then they tell you where you can start helping?
20:22 ^🔗	chronomex	freeforall.
20:23 ^🔗	igelritte	I've seen Mr. Scott's presentation at Defcon on how AT is going to save your shit...which sounds good to me...but that doesn't tell me a lot about how the group is organized.
20:23 ^🔗	SmileyG	some people write code
20:23 ^🔗	SmileyG	I appear and make comments
20:23 ^🔗	SmileyG	most people run some sort of downloaders
20:23 ^🔗	SmileyG	godane is ..... well I don't know :D
20:24 ^🔗	mistym	There are often projects you can help in by running code written by others, basically volunteering your bandwidth to help out.
20:24 ^🔗	chronomex	godane is affiliated but mostly works on solo projects
20:24 ^🔗	mistym	Those are usually advertised on the wiki and IRC, plus I think there's a mailing list for it now too.
20:24 ^🔗	igelritte	Unfortunately, I'm not really in a good position at the moment to run downloaders or anything else that requires a 24 hour network connection.
20:24 ^🔗	SmileyG	If you haven't got bandwidth, then you can help with the wiki and possibly coding...
20:25 ^🔗	SmileyG	doesn't need 24hr, it'll work when you can
20:25 ^🔗	SmileyG	upto a point
20:25 ^🔗	DFJustin	joepie91: that already exists as lsar in The Unarchiver, although it's all built-in and not invoking other apps
20:25 ^🔗	igelritte	I'm following this silly dream about living in Germany which means that my current address is--shall we say--fluid.
20:25 ^🔗	DFJustin	oh wait I'm wrong nm
20:26 ^🔗	DFJustin	keep forgetting unix cat is not the same as apple II cat :)
20:26 ^🔗	igelritte	Are most people in North America?
20:26 ^🔗	chronomex	a good number but by no means all
20:26 ^🔗	SmileyG	i'm UK
20:27 ^🔗	igelritte	I got that from the presentation. Something about a kid of 15 in Australia being threatened with legal action for downloading poetry.
20:27 ^🔗	DFJustin	igelritte: jason is in the gatekeeper role more or less, or cat herder if you prefer
20:27 ^🔗	chronomex	in order probably US, UK, AU, .eu
20:27 ^🔗	igelritte	Jason seems to do a lot.
20:29 ^🔗	DFJustin	but there's a lot of empowerment if you see something to just do it yourself
20:29 ^🔗	igelritte	Well, I can definitely help with the wiki
20:30 ^🔗	igelritte	when you say, 'coding', what do you mean?
20:30 ^🔗	soultcer	Programming stuff that downloads stuff
20:30 ^🔗	igelritte	I have a fair amount of experience with BASH scripting
20:31 ^🔗	igelritte	what are you guys using to download stuff?
20:31 ^🔗	DFJustin	perfect
20:31 ^🔗	DFJustin	primarily wget
20:31 ^🔗	igelritte	oh, hold on their solder, my BASH scripting is far from perfect
20:31 ^🔗	joepie91	DFJustin: The Unarchiver sounds like a comic hero :P
20:31 ^🔗	DFJustin	it's like a real life superhero
20:31 ^🔗	igelritte	but I have written some stuff using wget to batch download stuff for myself
20:32 ^🔗	DFJustin	the main difference is we use a parameter to wget to have it produce .warc files which are a full record of HTTP headers etc. suitable for going into the wayback machine
20:32 ^🔗	igelritte	lectures from the opencourse ware project at MIT
20:32 ^🔗	igelritte	hmmm
20:33 ^🔗	alard	Yes, so if you download anything for archiving, use the --warc-file option (available in Wget 1.14).
20:34 ^🔗	igelritte	hmmm. It appears that the wget that comes with Ubuntu these days is 1.13
20:34 ^🔗	igelritte	at least, so says dpkg
20:35 ^🔗	mistym	You'll need to build it yourself then (or grab a newer package). .warc support wasn't added until 1.14.
20:35 ^🔗	DFJustin	for our big multi-user projects we supply a ready-made VM with everything all set up and just a go button to push
20:35 ^🔗	igelritte	okay
20:35 ^🔗	igelritte	um, what are warc files and why use them?
20:36 ^🔗	DFJustin	warc is a standardized format for web archives, it includes all the HTTP response data from the server (not just the file contents) so that you can "play it back" with a proxy and duplicate the original site exactly
20:36 ^🔗	igelritte	You'all are interested in full HTTP headers, or the way back machine?
20:36 ^🔗	igelritte	interesting
20:37 ^🔗	igelritte	very interesting
20:37 ^🔗	DFJustin	the main impetus is that it's a requirement for wayback to integrate the data (proper timestamps are a necessity, for example)
20:37 ^🔗	igelritte	Okay, I can see what you're saying
20:38 ^🔗	DFJustin	everyone grabbed geocities kind of higgledy-piggledy and it's hard to pin down the dates for anything because of filesystems, time zones, modification time vs download time etc
20:39 ^🔗	DFJustin	so the later projects have been standardized on warc
20:39 ^🔗	igelritte	The Geocities project was quite an accomplishment
20:41 ^🔗	DFJustin	warc is big with the pointy-headed academic world because of formal documentation etc. so that gives us an in with that crowd too
20:41 ^🔗	DFJustin	unfortunately the end user tools for it are not great yet
20:43 ^🔗	igelritte	I loved Jason's picture of the datacenter where the nine terabytes where housed. It reminded me of this scene from 'Connections'--that interesting spin on discovery and invention that came out in the 70's by James Burke--where he holds up an old tape cartridge and expounds: "this device holds one million characters," in that tone of voice like the audience is supposed to piss themselves in amazement. You then do the math and realize that
20:43 ^🔗	joepie91	DFJustin: is there a format specification for warc?
20:43 ^🔗	joepie91	one that is publicly accessible
20:44 ^🔗	DFJustin	ISO 28500
20:45 ^🔗	joepie91	CHF 122,00
20:45 ^🔗	joepie91	eh.
20:46 ^🔗	joepie91	DFJustin; anything or any place that doesn't want to see the inside of my wallet?
20:46 ^🔗	joepie91	:\|
20:46 ^🔗	DFJustin	obviously, you can google it just as well as I can though
20:46 ^🔗	joepie91	yes, and I only get drafts
20:47 ^🔗	joepie91	do I seriously have to pirate a document to figure out what warc looks like
20:47 ^🔗	joepie91	:\|
20:47 ^🔗	igelritte	I have to say that you folks seem down right Edwardian in your manners. Most of my experiences in chatrooms with techsavy folks have not been so pleasant.
20:48 ^🔗	SmileyG	:D
20:48 ^🔗	SmileyG	Most people suck.
20:48 ^🔗	SmileyG	I think the fact everyone is here because they care about it helps, rather than being here because of "work" or other reasons.
20:49 ^🔗	DFJustin	my suspicion is that the 0.18 draft is the same as the final because international standards move slow but I'll defer that to somone whose head is pointier :)
20:49 ^🔗	igelritte	I was working on Linux from Scratch a few years back; their IRC...well, let's just say that you need a thick skin.
20:49 ^🔗	alard	I believe the bib-something site has a PDF of a draft of the warc spec.
20:49 ^🔗	alard	The warc people at archive.org assured me that that's what they use.
20:49 ^🔗	igelritte	And none of those people were there for work...
20:49 ^🔗	DFJustin	http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf
20:50 ^🔗	SmileyG	ah yeah hmm
20:50 ^🔗	alard	That's it. Just change the version header WARC/0.18 with WARC/1.0, or something.
20:50 ^🔗	SmileyG	igelritte: I've been "both" sides of the arguement
20:50 ^🔗	alard	There's also a warc implementation guidelines somewhere.
20:51 ^🔗	joepie91	alard: the draft is representative?
20:51 ^🔗	*	joepie91 really hates 'standards' that you can't just view
20:52 ^🔗	alard	Yes, I believe so. The Heritrix implementation is based on the same draft, so that's something.
20:52 ^🔗	igelritte	Tell me about it joepie91. I worked in Teleco for years. Any idea what they want for a membership to the ITU?
20:52 ^🔗	alard	http://netpreserve.org/publications/WARC_Guidelines_v1.pdf
20:52 ^🔗	joepie91	igelritte: not sure I even want to know the amount of digits
20:53 ^🔗	igelritte	It's pretty gross
20:53 ^🔗	joepie91	alard: that 404s
20:53 ^🔗	joepie91	anyhow, I'll use the bibnum one then
20:54 ^🔗	alard	Does it? I just copied the link I put on the wiki months ago. :)
20:54 ^🔗	SmileyG	http://archiveteam.org/index.php?title=BT_Internet C-, needs work
20:54 ^🔗	SmileyG	:D
20:54 ^🔗	alard	http://www.netpreserve.org/resources/warc-implementation-guidelines-v1
20:54 ^🔗	alard	http://www.netpreserve.org/sites/default/files/resources/WARC_Guidelines_v1.pdf
20:55 ^🔗	joepie91	thankies
20:55 ^🔗	alard	(It's pretty silly that an "internet preservation consortium" doesn't have stable urls.)
20:55 ^🔗	DFJustin	one of the nice things about WARC though is it's basically human readable, you open it up and bam headers
20:55 ^🔗	DFJustin	so it's reasonably future-proof
20:57 ^🔗	joepie91	lol alard
21:00 ^🔗	SmileyG	Can't upload images to wiki?
21:00 ^🔗	igelritte	When you watch Jason's presentation at Defcon, you know that other people are involved and that recruits are needed, but the specifics are still a little vague. I guess that I've spent so much time interacting with organizations by being told what to do that the free-for-all comes off as very chaotic. Still not very sure where I can plug in.
21:00 ^🔗	SmileyG	why didn't I see "upload file" ? XD
21:00 ^🔗	joepie91	hmm, interesting... http://www.webarchivingbucket.com/
21:00 ^🔗	joepie91	igelritte: link to presentation?
21:01 ^🔗	igelritte	sure
21:02 ^🔗	DFJustin	well our formal projects now are all "run the warrior VM" where we tell your computer exactly what to do
21:02 ^🔗	joepie91	www.btinternet.com/~catechnology
21:02 ^🔗	joepie91	www.btinternet.com/~ted.power
21:02 ^🔗	joepie91	www.dgsgardening.btinternet.co.uk
21:02 ^🔗	joepie91	www.mstracey.btinternet.co.uk
21:02 ^🔗	joepie91	cc alard
21:02 ^🔗	DFJustin	it's just that on top of that people have their own archiving side projects that are related to the mission in varying degrees
21:02 ^🔗	alard	joepie91: http://tracker.archiveteam.org/webshots/rescue-me
21:03 ^🔗	joepie91	alard: webshots?
21:03 ^🔗	joepie91	shouldn't that be btinternet?
21:03 ^🔗	alard	Oops, sorry, http://tracker.archiveteam.org/btinternet/rescue-me
21:03 ^🔗	joepie91	:P
21:03 ^🔗	DFJustin	is that expecting urls or user names
21:03 ^🔗	alard	usernames
21:04 ^🔗	joepie91	0 items added to the queue
21:04 ^🔗	joepie91	Thanks for your help!
21:04 ^🔗	joepie91	lol
21:04 ^🔗	alard	Heh.
21:05 ^🔗	alard	The tracker really appreciates your contribution, it just wasn't useful. :)
21:05 ^🔗	joepie91	haha
21:06 ^🔗	joepie91	looks like catarc works well :)
21:06 ^🔗	joepie91	http://sebsauvage.net/paste/?9e695a09848493ea#Yy3GjmiyMI4bfhUcKv9vahutcX48KTJBHLivJh8l2BU=
21:06 ^🔗	DFJustin	nice regex
21:07 ^🔗	underscor	<igelritte> I got that from the presentation. Something about a kid of 15 in Australia being threatened with legal action for downloading poetry.
21:07 ^🔗	underscor	I can't remember
21:07 ^🔗	underscor	hahahahahaha
21:07 ^🔗	underscor	was that bluemax?
21:08 ^🔗	SmileyG	what happened with htat o_O
21:08 ^🔗	underscor	joepie91: we conform to the draft fyi
21:09 ^🔗	SmileyG	http://archiveteam.org/index.php?title=BT_Internet <<< wtf is iwth the no description below the imae
21:09 ^🔗	joepie91	ok, thanks :P
21:09 ^🔗	SmileyG	image
21:09 ^🔗	underscor	we being archive.org
21:09 ^🔗	underscor	SmileyG: lulu poetry's IT department sent a scary letter to him
21:09 ^🔗	underscor	"scary" "letter"
21:10 ^🔗	SmileyG	o
21:11 ^🔗	joepie91	igelritte: does a video of the defcon presentation exist?
21:11 ^🔗	joepie91	I can't find it
21:14 ^🔗	alard	SmileyG: The "No description" comes from the image, I think.
21:14 ^🔗	SmileyG	except it has a description :/
21:16 ^🔗	Dark-Star	problems with the archive? I'm getting "rsync: failed to connect to fos.textfiles.com: Connection timed out (110)" all the time
21:16 ^🔗	alard	SmileyG: Oh. Then maybe it's in the template? http://archiveteam.org/index.php?title=Template:Infobox_project&action=edit
21:18 ^🔗	underscor	Dark-Star: it's down atrm
21:18 ^🔗	underscor	atm*
21:19 ^🔗	Dark-Star	ah okay. I'll just leave the Warrior running overnight then. I guess it'll automatically resume the upload later
21:23 ^🔗	SmileyG	alard: ah yeah hmmm :S
21:24 ^🔗	SmileyG	weird because the mobile me one doesn't do it
21:26 ^🔗	igelritte	right on...I'm not as stupid as I originally suspected
21:26 ^🔗	igelritte	GNU Wget 1.14 built on linux-gnu.
21:27 ^🔗	igelritte	I now have the ability to support warc
21:27 ^🔗	igelritte	though, my dpkg still thinks that I'm working with 1.13
21:28 ^🔗	igelritte	It's probably been six months or more since I've compiled and installed anything from scratch. It's funny how quickly you forget that shit.
21:28 ^🔗	alard	igelritte: I don't want to temper your enthusiasm and sense of achievement, but you might want to check if your new Wget includes gzip and SSL support. It's in wget -V, I think.
21:30 ^🔗	igelritte	well, I'm pretty sure that it does because I kept getting an SSL error and had to dig into why and then install libcurl and libgnutls dev packages in order to get wget to compile correctly
21:30 ^🔗	igelritte	but I will check
21:30 ^🔗	alard	Ah good, then it'll probably work.
21:30 ^🔗	alard	soultcer: Starting TinyBack for Item
21:31 ^🔗	alard	(Hint: the git clone it's very slow if there's no .git in the repository url: https://github.com/soult/tinyback.git )
21:32 ^🔗	soultcer	It is? Damn, I always felt so clever because I had to type 4 characters less
21:32 ^🔗	igelritte	well, right under the version number, you get the following list: +digest +https +ipv6 +iri +large-file +nls -ntlm +opie +ssl/gnutls
21:32 ^🔗	soultcer	http://tracker.tinyarchive.org/v1/ <-- "ranking"
21:33 ^🔗	alard	soultcer: It's strange, because it does seem to work, but it just takes a long time. I was wondering what my warrior was doing.
21:33 ^🔗	igelritte	I'm not sure about the 'wget-V, I' syntax...is that supposed to be 'wget -V -I'?
21:33 ^🔗	igelritte	or really a comma
21:33 ^🔗	alard	Heh. The comma and I are part of the sentence. :)
21:33 ^🔗	*	igelritte laughs at self
21:34 ^🔗	primus	igelritte: if you're interested in downloading you can download ArchiveTeam Warrior virtual machine - it has everything already set up. http://archive.org/details/archiveteam-warrior
21:35 ^🔗	alard	To check if you have gzip support, use: wget --help \| grep warc-compression and see if it returns something. If it does, it works.
21:35 ^🔗	igelritte	I'm a little limited on what I can do with downloading at the moment. This network connection is not really my own.
21:36 ^🔗	DFJustin	<joepie91> igelritte: does a video of the defcon presentation exist? <-- https://www.youtube.com/watch?v=-2ZTmuX3cog
21:38 ^🔗	igelritte	alard: I get the "no-warc-compression"; I'm guessing that warc uses gzip for compression
21:38 ^🔗	igelritte	?
21:40 ^🔗	alard	Then your Wget is in top condition. The thing with gzip is: you can make .warc and .warc.gz files. It is much better to do the gzip compression in Wget than to do it afterwards. Wget makes a new gzip record for each downloaded file, so it's possible to extract only part of the .warc.gz. If you use the gzip utility to compress your warc afterwards, you can only decompress everything at once.
21:43 ^🔗	igelritte	Just performed a quick little test where I ran the following: wget --warc-file test http://en.wikipedia.org/wiki/Jason_Scott_Sadofsky. This seems to have created the 'test' file that I asked for.
21:43 ^🔗	igelritte	-rw-rw-r-- 1 23386 Oct 10 23:41 test.warc.gz
21:44 ^🔗	joepie91	quick question to alard: how does one write a setup.py where the resulting install package will copy a python file to the bin directory?
21:44 ^🔗	joepie91	/usr/bin etc
21:44 ^🔗	alard	gunzip -c test.warc.gz to look inside
21:45 ^🔗	alard	Why do you think I would know? I'm a copy-paste setup.py writer. :)
21:45 ^🔗	alard	scripts, I think: https://github.com/ArchiveTeam/seesaw-kit/blob/master/setup.py#L41-44
21:46 ^🔗	joepie91	well, seesaw does it :P
21:46 ^🔗	joepie91	and alright, thanks
21:46 ^🔗	alard	I thought you were the python distribution / pip / pypi expert. :)
21:47 ^🔗	igelritte	very interesting. That seems to have worked. I DO have an HTTP document. It doesn't look anything like a wiki, but I'm guessing why I know that is.
21:48 ^🔗	joepie91	alard: oh, not at all
21:49 ^🔗	joepie91	I just know how to package up a module with an existing setup.py
21:49 ^🔗	joepie91	:P
21:49 ^🔗	joepie91	and that's it
21:57 ^🔗	igelritte	so, when I unpack this archive file (warc) I should expect to find nothing put pure HTTP?
21:57 ^🔗	alard	You'll find warc records, some of which have a HTTP body.
21:58 ^🔗	igelritte	hmmm
21:58 ^🔗	alard	You get some warc headers identifying the record (type, target-uri, timestamp etc.), then the http request or response.
21:58 ^🔗	alard	There are special types of warc records with metadata, such as the wget command line and log.
21:59 ^🔗	alard	So it's not the most user-friendly format, you need to work to get the data out.
21:59 ^🔗	alard	The good thing is that everything is in the file, so you can get it out.
22:00 ^🔗	igelritte	This is all just for my education; so, feel free to tell me to fuck off when you lose patience. But, where can I find these headers? When I open the file with a text editor, it spears to be just HTML.
22:01 ^🔗	alard	You'll have to look better then, they're in there.
22:01 ^🔗	alard	It starts with WARC/1.0 or something, then there's WARC-Target-URI, etc.
22:04 ^🔗	SketchCow	Hey, so my commentary before.
22:06 ^🔗	alard	It has scrolled away. :)
22:06 ^🔗	DFJustin	SketchCow: http://archive.org/details/archiveteam-city-of-heroes-www is not on the list
22:07 ^🔗	igelritte	crazyness...I just used vi on the test.warc.gz file and the headers you mentioned showed up. Vi also showed me all the compressed content. I didn't know that vi could do that...
22:07 ^🔗	SmileyG	SketchCow: geocities - theres a dump on the ia but I can't find it anymore (and it was searchable.... we really need to make those links more accessable...)
22:08 ^🔗	alard	http://archive.org/details/archiveteam-qaudio-rescue
22:08 ^🔗	alard	http://archive.org/details/archiveteam-cinch
22:08 ^🔗	joepie91	wait wait wait wait, what? Jeroenz0r is/was part of urlteam?
22:09 ^🔗	SketchCow	Only WARC items. So Geocities proceeds that.
22:10 ^🔗	SmileyG	ah k
22:12 ^🔗	igelritte	Perhaps I'm really thick here...and that wouldn't be a surprise...but I'm still not seeing how I can contribute. Is there a list of "shit that needs to get done and we'd be thrilled if you'd take it on" some where?
22:12 ^🔗	SketchCow	Both added, alard
22:12 ^🔗	SketchCow	What's your skillset, igelritte?
22:12 ^🔗	DFJustin	various godane grabs(tm) at https://archive.org/search.php?query=warc%20uploader%3A%22slaxemulator%40gmail.com%22
22:13 ^🔗	alard	There are some groklaw.net warcs: http://archive.org/details/groklaw.net-pdfs-2004-20120827
22:13 ^🔗	igelritte	Well, I've done some BASH scripts. I'm trilingual. I've done lots of networking.
22:13 ^🔗	igelritte	And there's a bunch of voip in there too
22:13 ^🔗	alard	http://archive.org/search.php?query=groklaw%20warc
22:14 ^🔗	joepie91	igelritte: is there any chance you can turn the install script for the webshots script, into something more sane?
22:14 ^🔗	joepie91	because I suck at bash :P
22:14 ^🔗	igelritte	I'm not that awesome at it either, but I can look at it.
22:14 ^🔗	joepie91	current script is at http://cryto.net/projects/webshots/webshots_debian.sh
22:14 ^🔗	alard	http://archive.org/search.php?query=warc%20journalstar (but it's getting more obscure now)
22:14 ^🔗	joepie91	thanks :)
22:16 ^🔗	igelritte	Hmmm...
22:16 ^🔗	nintendud	joepie91: you can set a trap on error to avoid all the conditionals
22:16 ^🔗	igelritte	this could use some commenting and perhaps a header
22:16 ^🔗	nintendud	and then have it print "Error on line x". Not as nice of a message though.
22:17 ^🔗	igelritte	who wrote this? And why are they doing an apt-get at the beginning?
22:17 ^🔗	joepie91	igelritte: I did
22:17 ^🔗	joepie91	and the apt-get is to install dependencies
22:18 ^🔗	alard	http://archive.org/search.php?query=uploader%3A%28slaxemulator%40gmail.com%29%20AND%20warc
22:18 ^🔗	DFJustin	is there an echo in here
22:18 ^🔗	igelritte	I think I see what you're doing here, and I understand why you would do an apt-get update before doing an install
22:18 ^🔗	alard	DFJustin: Oh, sorry. :)
22:18 ^🔗	igelritte	but, I don't think I understand enough of the purpose here to understand why you would do that in a script
22:19 ^🔗	joepie91	igelritte: it's apt-get update, not upgrade
22:19 ^🔗	joepie91	just updates the package list
22:19 ^🔗	igelritte	I'm guessing that my ignorance is to blame
22:19 ^🔗	igelritte	right
22:19 ^🔗	nintendud	joepie91 / igelritte: here's a nice article on BASH traps, btw. http://phaq.phunsites.net/2010/11/22/trap-errors-exit-codes-and-line-numbers-within-a-bash-script/
22:19 ^🔗	igelritte	typo on my part
22:19 ^🔗	joepie91	had it break for some people because the package lists weren't up to date, so that's why update is there :)
22:21 ^🔗	nintendud	joepie91: also, why are you using useradd? On Debian, you're supposed to use the adduser command afaik
22:21 ^🔗	joepie91	adduser is interactive
22:21 ^🔗	nintendud	Doesn't have to be
22:21 ^🔗	nintendud	At least, I think you can make it a one-liner
22:21 ^🔗	joepie91	iirc I haven't found a way to make it not interactive
22:21 ^🔗	joepie91	:P
22:22 ^🔗	joepie91	anyway, any particular reason not to use useradd?
22:22 ^🔗	nintendud	Does useradd make the home directory?
22:22 ^🔗	joepie91	yes
22:22 ^🔗	nintendud	o
22:22 ^🔗	nintendud	Welp, adduser just follows a nice configuration file that specifies things like the permissions to set on the home directory among other things
22:23 ^🔗	nintendud	But I guess useradd works OK. I was just curious. :-)
22:36 ^🔗	DFJustin	SketchCow: there are more qaudio items, http://archive.org/details/archiveteam-qaudio-archive-1 through http://archive.org/details/archiveteam-qaudio-archive-7
22:39 ^🔗	DFJustin	also fan fiction http://archive.org/search.php?query=%22fan%20fiction%22%20archiveteam
22:41 ^🔗	joepie91	right
22:41 ^🔗	joepie91	pip install catarc
22:41 ^🔗	joepie91	:)
22:41 ^🔗	joepie91	cat for archives
22:48 ^🔗	SketchCow	OK, so I got out of a meeting about incorporating archive team stuff into wayback
22:48 ^🔗	SketchCow	NATURALLY it's slightly more complicated in some cases.
22:49 ^🔗	SketchCow	Let me make some changes to the thing.
22:52 ^🔗	chronomex	of course it is
22:52 ^🔗	chronomex	what kind of changes do they want?
22:55 ^🔗	SketchCow	Look at the document again. All green ones are cleared for takeoff.
22:55 ^🔗	chronomex	wow, awesome
22:57 ^🔗	chronomex	so looks like they can just suck in warc-in-nothing, yes?
22:59 ^🔗	SketchCow	Yes
22:59 ^🔗	SketchCow	They cannot suck in warc-in-archives
22:59 ^🔗	SketchCow	So, next step is to look at the archives ones and see if there's not too many WARCs in it, say less than 100
22:59 ^🔗	chronomex	I mean "just suck in" as in "point the ingestor at"
22:59 ^🔗	DFJustin	good thing we didn't upload 250tb of that XD
23:00 ^🔗	chronomex	lol yes
23:01 ^🔗	chronomex	mobileme: 280T of .tar containing .warc.gz
23:01 ^🔗	chronomex	soooo
23:02 ^🔗	SketchCow	We're aware of it and there'll be a project to deal with that.
23:02 ^🔗	SketchCow	But I don't want to rush it.
23:02 ^🔗	SketchCow	So Brewster's letting me make doubled files for weird ones.
23:06 ^🔗	DFJustin	even if there's a shitload of warcs inside they can all be cat-ed together into one megawarc right
23:07 ^🔗	arkhive	is there a webshots tracker I can check the progress? (I'm unable to help, I'm just curious how it's going)
23:07 ^🔗	DFJustin	http://tracker.archiveteam.org/webshots/
23:07 ^🔗	arkhive	thank you :)
23:08 ^🔗	DFJustin	underscor making his isp cry again
23:08 ^🔗	SketchCow	YEah, but the machine is still down
23:08 ^🔗	SketchCow	so I don' know what's going on
23:08 ^🔗	SketchCow	DFJustin: Yes, exactly.
23:13 ^🔗	joepie91	alard: what about an 'assorted' warrior project
23:13 ^🔗	joepie91	with things that are small or heavily rate-limited (like some urlteam targets)
23:13 ^🔗	joepie91	that the warrior automatically switches to whenever it has nothing else to do
23:14 ^🔗	chronomex	that sounds cool.
23:14 ^🔗	joepie91	for example, if the current selected project is done
23:14 ^🔗	joepie91	a "let's not waste any time or bandwith that we have" mode, so to say :P
23:14 ^🔗	chronomex	urlteam is a basically-no-bandwidth project, it might actually make more sense to run it in the background always.
23:15 ^🔗	joepie91	maybe have an 'always running' and 'assorted' project
23:15 ^🔗	chronomex	yeah
23:15 ^🔗	joepie91	separate projects... one always runs, like urlteam
23:15 ^🔗	joepie91	and assorted is filled with whatever small project is happening that doesn't warrant its own separate project, really
23:15 ^🔗	chronomex	'assorted' would be filler for "let archiveteam choose"
23:15 ^🔗	joepie91	as a fallback when it has nothing better to do
23:15 ^🔗	joepie91	well yes, but the thing is
23:16 ^🔗	joepie91	say that I've got it configured for btinternet
23:16 ^🔗	joepie91	the moment btinternet is done, which will be soon
23:16 ^🔗	joepie91	my warrior will be bored out of its skull, no?
23:17 ^🔗	chronomex	yes
23:18 ^🔗	joepie91	would be good if it switched to 'assorted' then :P
23:18 ^🔗	joepie91	'let archiveteam choose' has a pretty different function
23:18 ^🔗	joepie91	that option should always refer to the most urgent project
23:18 ^🔗	joepie91	such as, in this case, webshots
23:18 ^🔗	joepie91	assorted would have the stuff that isn't really urgent or significant, but has to be done anyway
23:18 ^🔗	joepie91	at some point in time
23:21 ^🔗	chronomex	ah
23:36 ^🔗	flaushy	hi, is fos.textfiles.com down?
23:36 ^🔗	joepie91	it is
23:37 ^🔗	flaushy	rsync will happily retry until it reappears, right?
23:37 ^🔗	joepie91	if I recall correctly, it will retry 50 times
23:37 ^🔗	joepie91	before giving up
23:37 ^🔗	joepie91	alard can probably confirm on that
23:37 ^🔗	flaushy	:( 50k link user in queue
23:37 ^🔗	SketchCow	Fortress of Solitude is Back
23:37 ^🔗	joepie91	ouch
23:37 ^🔗	joepie91	oh, it is?
23:38 ^🔗	joepie91	SketchCow: my warrior disagrees
23:38 ^🔗	joepie91	rsync: failed to connect to fos.textfiles.com: Connection timed out (110)
23:38 ^🔗	flaushy	same here, but i guess it will work soon then :)
23:39 ^🔗	flaushy	probably we are just hammering it currently
23:39 ^🔗	flaushy	and thx for the info!
23:39 ^🔗	joepie91	aaaaand there it went
23:39 ^🔗	joepie91	:D
23:41 ^🔗	SketchCow	Hooray, 517 rsync connections.
23:41 ^🔗	joepie91	lol
23:41 ^🔗	flaushy	working for me now too :)
23:42 ^🔗	joepie91	:\|
23:42 ^🔗	joepie91	uploads just died
23:42 ^🔗	joepie91	like, literally flatlined
23:42 ^🔗	joepie91	ah, it resumed
23:42 ^🔗	joepie91	and flatlined again
23:42 ^🔗	joepie91	wat
23:43 ^🔗	DFJustin	alard: you wanna run through the usernames in these https://en.wikipedia.org/wiki/Wikipedia:Bot_requests#btinternet
23:43 ^🔗	igelritte	so, from the following, I can assume that fos = fortress of solitude and that this is some place where folks are trying to rsync there current downloads to. Feel free to direct me to a link that will shut me up.
23:43 ^🔗	igelritte	*thier
23:43 ^🔗	igelritte	or maybe their
23:44 ^🔗	joepie91	igelritte: yes, fos is where the uploads go
23:44 ^🔗	igelritte	At some point grammar will come bck to me
23:44 ^🔗	chronomex	until then
23:44 ^🔗	igelritte	indeed
23:46 ^🔗	flaushy	phew... seems like some 1 gb stuff is in queue on nooon
23:50 ^🔗	joepie91	DFJustin: http://pastie.org/5032511
23:51 ^🔗	joepie91	is the clean version
23:51 ^🔗	joepie91	of all usernames for both .com and .co.uk
23:51 ^🔗	joepie91	sorted, unique
23:51 ^🔗	joepie91	also cc alard, idk if that list is already in the tracker
23:51 ^🔗	joepie91	k, time to sleep
23:51 ^🔗	joepie91	goodnight all :)
23:51 ^🔗	DFJustin	nice thanks
23:59 ^🔗	SketchCow	Well, FOS is getting CRUSHED, we'll see how long this lasts.
23:59 ^🔗	SketchCow	848 Rsync collection
23:59 ^🔗	nintendud	lol

irclogger-viewer