#archiveteam 2012-10-10,Wed

↑back Search

Time Nickname Message
00:09 🔗 SketchCow WE'LL FIND OUT
00:24 🔗 godane SketchCow: I'm grabing all of offical xbox magazine podcast
00:24 🔗 godane there is like 311 podcast
00:25 🔗 godane *podcasts
00:25 🔗 godane i'm uploading the rest of no bs podcast now too
02:49 🔗 dashcloud so, I've got all the laptop service manuals from dell's ftp- someone have a place I can upload them to?
11:37 🔗 joepie91 alard: is btinternet a warrior project yet?
11:38 🔗 alard Yes, it's more or less ready (barring any new insights) but it's not actually on the warrior.
11:39 🔗 alard https://github.com/ArchiveTeam/btinternet-grab
11:39 🔗 alard Is ready to go.
11:39 🔗 alard (Almost.)
11:40 🔗 alard Why?
11:42 🔗 joepie91 well, when it's done, my warrior has something important to do :P
11:43 🔗 alard We should keep looking for more usernames, though.
11:43 🔗 alard I added the sites from DMOZ, from the wayback machine and am waiting for the btinternet links on tvtropes.org.
11:45 🔗 joepie91 alright
11:49 🔗 alard I'm now downloading the wikipedia dump as well.
11:50 🔗 joepie91 wikipedia dump? as in, find btinternet links on wikipedia?
11:50 🔗 joepie91 speaking of which.. I'll have a look in the stackexchange dump
11:50 🔗 joepie91 I have it here locally
11:53 🔗 alard joepie91: Yes, bunzip2 | grep ...
11:54 🔗 alard It seems that there are a few links on Wikipedia: https://encrypted.google.com/search?hl=en&q=site%3Awikipedia.org%20btinternet.co.uk
11:55 🔗 joepie91 oh goddamnit, I removed the stackexchange data dump a few days ago
11:55 🔗 joepie91 redownload time
11:59 🔗 SmileyG alard: I think "all Projects" tab in warrior should be "Choose Project" ?
12:00 🔗 alard SmileyG: Perhaps. But "Choose" is a verb. "Settings" is not. Is "Available projects" a solution?
12:01 🔗 SmileyG Yeah that works
12:01 🔗 SmileyG Currently I'd think "All Projects" would select all projects...... make sense?
12:03 🔗 alard Yes, I think I understand your point. (Although you could also say that it's a tab, not a button, so it shows you "all projects", like it does.)
12:03 🔗 SmileyG Hehe
12:03 🔗 joepie91 UI design is hard :P
12:04 🔗 SmileyG Well I have a habit of reading things differently to others, but I was good at it at uni. :S
12:04 🔗 alard It's fun.
12:15 🔗 alard http://tracker.archiveteam.org/btinternet/
12:16 🔗 alard (Don't go too fast.)
12:16 🔗 BlueMaxim thanks for reminding me to see how webshots was doing :P
12:16 🔗 BlueMaxim underscor with 2364GB.
12:16 🔗 BlueMaxim I'm going to kill him one day
12:19 🔗 balrog_ why does it say only 8 items done so far? :P
12:19 🔗 balrog_ oh I see...
12:19 🔗 balrog_ nvm :P
12:21 🔗 alard balrog_: You could be number 1 with 9!
12:22 🔗 balrog_ alard: do I have to use warrior?
12:22 🔗 balrog_ :|
12:23 🔗 alard What's wrong with the warrior? It's a small project.
12:23 🔗 balrog_ takes up more ram and cpu on my side :/
12:23 🔗 BlueMaxim It's pretty minimal how much it takes up.
12:24 🔗 joepie91 BlueMaxim: not exactly
12:24 🔗 joepie91 it uses up to 20% of my 4GB of RAM
12:24 🔗 SmileyG how long til bt ties?
12:24 🔗 joepie91 usually around 13
12:24 🔗 BlueMax joepie91, seriously? I thought it only needed 256MB of RAM
12:24 🔗 joepie91 BlueMax: that's the VM itself - apparently virtualbox adds a bunch of overhead on top of that
12:24 🔗 joepie91 also
12:24 🔗 joepie91 or something
12:24 🔗 joepie91 it's quite heavy on CPU
12:24 🔗 joepie91 on my shitty notebook i3
12:25 🔗 joepie91 2 x 1,3ghz
12:25 🔗 Cameron_D hm, these are synging like 6kb
12:25 🔗 Cameron_D *syncing
12:25 🔗 Cameron_D I guess if there is only one page
12:26 🔗 Cameron_D Oh, 404 error, even smaller
12:26 🔗 BlueMax guess I didn't notice
12:27 🔗 BlueMax My computer must be better at this than I thought :P
12:28 🔗 SmileyG can the tracker do less more verbose than 0Mb?
12:30 🔗 Deewiant I ran into virtualbox using 7 gigabytes of ram before it got OOM-killed
12:30 🔗 Deewiant While running the warrior a few days back
12:31 🔗 joepie91 lot of 0MBs
12:31 🔗 joepie91 lol
12:31 🔗 BlueMax memory leak to the max :P
12:32 🔗 SmileyG alard: when should it start new processes :S, I've got it set to 6 but it still only shows 4?
12:33 🔗 alard SmileyG: When an item finishes the warrior checks the number to see how many new items there should be.
12:33 🔗 SmileyG hmmm k
12:33 🔗 SmileyG ones just finished, lets ee if it works this time
12:33 🔗 SmileyG Also I've changed to BT but the banner still shows webshots (I presume because some of the jobs are still webshots).
12:34 🔗 joepie91 have there ever been archiving/warrior projects where the warriors were throttled/rate-limited/blocked?
12:34 🔗 joepie91 SmileyG: it will first finish the webshots jobs
12:34 🔗 joepie91 then move on to BT
12:34 🔗 SmileyG I rate limit mine joepie91 :P
12:34 🔗 joepie91 oooo, 39MB
12:34 🔗 alard The warrior can't run multiple projects at the same time, so yes, it waits for webshots to complete.
12:35 🔗 SmileyG ok, makes sense :D
12:35 🔗 alard (Also: why not keep it on webshots? I expect btinternet won't take long.)
12:35 🔗 BlueMax it'd be cool if it could multitask
12:35 🔗 BlueMax one process on one project, four on another
12:35 🔗 SmileyG I have a webshots running at work on 5Mbit, this is amazingly slow compared to that ;)
12:42 🔗 joepie91 alard: http://www.quickonlinetips.com/archives/2012/09/google-feedburner-shutting-down/
12:43 🔗 joepie91 not sure if there's any useful data on feedburner
12:43 🔗 joepie91 but sure looks like signs of imminent death
12:43 🔗 joepie91 also http://searchenginewatch.com/article/2213759/Google-Shutting-Down-AdSense-for-Feeds-Classic-Plus-More-Services?utm_source=twitterfeed&utm_medium=twitter
12:43 🔗 alard Isn't that just a proxy/cache/stats service?
12:43 🔗 Cameron_D Yeah, it is a stats tracking service for RSS feeds
12:44 🔗 Cameron_D So thousands of RSS feeds will break
12:44 🔗 Cameron_D but they don't really host much data
12:44 🔗 joepie91 this may also be a problem for THQ-related sites: http://www.gamearena.com.au/news/read.php/5116588
12:44 🔗 joepie91 THQ Asia Pacific shutting down
12:44 🔗 godane i got to grab my t3 magazine podcast then
12:45 🔗 joepie91 are there any THQ Asia Pacific-run sites that have user content?
12:45 🔗 Cameron_D looking now
12:46 🔗 godane Carmeron_D: links to a lot podcasts and stuff could be lost
12:47 🔗 godane http://feeds.feedburner.com/T3/podcast
12:48 🔗 Cameron_D feedburner just acts as a proxy though (To collect stats)
12:48 🔗 Cameron_D Somewhere on the t3 site is the actual feed
12:48 🔗 Cameron_D At least that is how I remember it working
12:49 🔗 godane but that feed i think doesn't go back that far
12:50 🔗 joepie91 Cameron_D: also as an aggregrator afaik
12:50 🔗 godane there only feed is from feedburner
13:07 🔗 balrog_ the warrior image has issues
13:08 🔗 balrog_ first off, vmware complains that it doesn't meet ova specs
13:08 🔗 balrog_ second, I get an error that there's an ide slave with no master
13:08 🔗 alard balrog_: Which image?
13:09 🔗 alard 20121008?
13:09 🔗 balrog_ archiveteam-warrior-v2-20121008
13:09 🔗 balrog_ yes
13:09 🔗 Cameron_D http://dmorton.staff.hostgator.com/archiveteam-warrior-vmware.ova vmware-compatible (albeit an older version)
13:09 🔗 balrog_ why did this one break?
13:10 🔗 alard I don't know about the ova specs. There previously was a problem with the filename. I had exported the image as archiveteam-warrior-v2.ova, and then renamed it to include the date. This new image is exported with the correct name.
13:10 🔗 alard And IDE slave with no master, that seems to be a virtualbox - vmware incompatibility.
13:10 🔗 balrog_ The import failed because /path/to/archiveteam-warrior-v2-20121008.ova did not pass OVF specification conformance or virtual hardware compliance checks. Click Retry to relax OVF specification and virtual hardware compliance checks and try the import again, or click Cancel to cancel the import. If you retry the import, you might not be able to use the virtual machine in VMware Fusion.
13:11 🔗 alard I've added two disks in VirtualBox, but for some reason VMware ends up with two controllers: 1-master for disk 1, 2-slave for disk 2.
13:11 🔗 balrog_ and then ... There is an IDE slave with no master at ide1:1. This configuration does not work correctly in virtual machines. Move the disk/CD-ROM from ide1:1 to ide1:0 using the configuration editor.
13:12 🔗 balrog_ I wouldn't be surprised if VBox is malforming the ova
13:12 🔗 balrog_ VBox is unfortunately full of bugs
13:13 🔗 Cameron_D heh, ESXi still rejects the file too http://i.imgur.com/z3Kox.png
13:14 🔗 balrog_ hm, they have an OVF tool
13:16 🔗 S[h]O[r]T balrog_
13:16 🔗 S[h]O[r]T are you running vmware workstation?
13:16 🔗 balrog_ no, fusion
13:16 🔗 balrog_ which is basically the mac version of workstation
13:17 🔗 S[h]O[r]T when i first imported archiveteam-warrior-v2-20120813 i got the error about it not being valid. then i just imported again and it worked.
13:17 🔗 S[h]O[r]T i got the ide error as well after that too
13:17 🔗 balrog_ yeah but I keep getting the ide error
13:17 🔗 S[h]O[r]T you just have to go into the settings and change the second drive to ide0:1
13:17 🔗 S[h]O[r]T from ide 1:0
13:23 🔗 balrog_ hmm
13:23 🔗 balrog_ what if someone imported the vm into vmware, fixed it, and exported it?
13:23 🔗 balrog_ I wonder if the ova file would be more up-to-spec
13:25 🔗 S[h]O[r]T youd probably want to export as a vmdk or wahtever the vmware equivlent is. you can always just rar up the vmdk files and if someone uses them vmware will just ask if they copied it
13:25 🔗 joepie91 alard: btinternet\.(com|co\.uk)
13:25 🔗 joepie91 right?
13:25 🔗 balrog_ ova is better if it's compatible
13:25 🔗 balrog_ err, compliant
13:25 🔗 balrog_ apparently vbox does't produce compliant files
13:26 🔗 joepie91 bingo
13:26 🔗 joepie91 http://www.btinternet.com/~se16/hgb/statjoke.htm
13:26 🔗 joepie91 se16 :P
13:27 🔗 godane uploaded: http://archive.org/details/cdrom-linuxformatmagazine-76
13:27 🔗 alard joepie91: Yes, and then www\.(.+)\.btinternet or /~([^%?/]+)
13:28 🔗 SmileyG Final webshots rsync finishes in a few min and then bt ':D
13:29 🔗 joepie91 alard: I've also seen a few *without* www in front
13:29 🔗 joepie91 and just the username
13:31 🔗 joepie91 alard: 7z e -so *.7z | grep -P "(([^\s(/]+)\.)?btinternet\.(com|co\.uk)(\/~([^/ %?]+))?"
13:31 🔗 joepie91 :)
13:31 🔗 joepie91 will take a few hours for the torrent to finish downloading
13:31 🔗 joepie91 after that, that will yield all the relevant entries
13:36 🔗 joepie91 better:
13:36 🔗 joepie91 7z e -so *.7z 2> /dev/null | grep -Po "(([^\s(/]+)\.)?btinternet\.(com|co\.uk)(\/~([^/ %?]+))?"
13:57 🔗 balrog_ how well does warrior handle a network connection change?
14:01 🔗 balrog_ how well does warrior handle a network connection change?
14:01 🔗 balrog_ also, why no rsync with continue?
14:05 🔗 SmileyG balrog_: it should back off then continue once it figures it out
14:06 🔗 balrog_ you mean with the wget?
14:06 🔗 balrog_ rsync seems to lack continue though...
14:08 🔗 alard Doesn't --partial-dir enable --partial?
14:08 🔗 alard (Just rsync --partial is dangerous in this case, since SketchCow will move any file in the upload directory.)
14:22 🔗 willwill Hey there, if you see my name on uncompleted webshots job please release the lock.
14:25 🔗 alard willwill: No problem. (There will probably be other failed jobs, so I'll requeue them all at once later.)
14:46 🔗 SmileyG balrog_: rsync, continue?
14:46 🔗 SmileyG rsync knows what its sent and it doesn't require continue
14:46 🔗 balrog_ resume rather
14:47 🔗 balrog_ --partial or -P switch
14:47 🔗 SmileyG doesn'tneed it....
14:47 🔗 SmileyG partial does partial files
14:48 🔗 SmileyG rsync checks for each file as it goes
14:48 🔗 balrog_ yeah well a single .warc is pretty large
14:48 🔗 balrog_ and if it gets interrupted, whole thing has to start over
14:48 🔗 SmileyG yeah true, then your screwed :S
14:52 🔗 alard I've added --partial to btinternet, so the next project will have it too.
14:52 🔗 SmileyG Isn't that going to cause issues as you highlighted earlier?
14:52 🔗 alard No, because --partial-dir keeps the partial files in a separate directory.
14:53 🔗 alard They're uploaded to the .rsync-tmp/ subdirectory and moved when they're uploaded.
14:54 🔗 alard I thought --partial-dir would be enough, but apparently you need --partial too.
14:55 🔗 SmileyG oooo
14:55 🔗 SmileyG heh thats random devs for you
14:59 🔗 joepie91 alard: the title in the btinternet pipeline.py is still webshots
14:59 🔗 joepie91 ;)
15:02 🔗 alard I see. And apparently the title isn't used anywhere.
15:03 🔗 alard Wikipedia produced 933 new btinternet names.
15:04 🔗 joepie91 :D
15:04 🔗 joepie91 I'm searching math stackexchange now
15:04 🔗 SmileyG wikipedia? :o
15:04 🔗 joepie91 alard: stats stackexchange produced "se16" as only username
15:06 🔗 joepie91 it's referenced a *lot* on math. as well
15:06 🔗 joepie91 seems like a pretty important site
15:06 🔗 joepie91 ha
15:06 🔗 joepie91 Think twice before using BT as an ISP.
15:06 🔗 joepie91 on the homepage of that site
15:06 🔗 joepie91 BT used to provide its internet subscribers with a small amount of personal webspace, but did not promote the service so only the oldest most loyal customers used it. Now it now longer wishes to satisfy these customers and is closing the service down. So this page and others of mine, which have received over 2 million hits in 13 years, have to move.
15:06 🔗 joepie91 If your browser does not automatically go to http://www.se16.info/index.htm within a few seconds, you may want to go to the destination manually.
15:06 🔗 joepie91 My conclusion is that if you ever consider BT as a possible ISP for some reason, you should not expect that reason to last.
15:07 🔗 SmileyG yah
15:09 🔗 alard joepie91: We already had it. :) Processed items: 1, added to main queue: 0
15:12 🔗 joepie91 alright :P
15:12 🔗 joepie91 brb
15:14 🔗 DoubleJ alard: Quick question about the warrior: If there are multiple warcs waiting to upload, how does it decide which one goes next?
15:15 🔗 alard LIFO, I think, but if you really want to know you should check here: https://github.com/ArchiveTeam/seesaw-kit/blob/master/seesaw/task.py#L72-107
15:17 🔗 DoubleJ I... have no idea what I'm looking at.
15:18 🔗 DoubleJ But since it looks like array manipulation, I'm guessing my request to do smallest file first is a no-go.
15:19 🔗 alard That would be hard, I think. Then the queueing thing would have to know about file sizes.
15:19 🔗 alard And does it really matter?
15:19 🔗 DoubleJ Kinda-maybe. It'd free up more threads to download quicker.
15:20 🔗 DoubleJ As it is there are times when all my worker threads are waiting for one upload to finish so they can go.
15:20 🔗 DoubleJ Of course then you'd have a problem with large files never uploading, but you could conceivably have that with LIFO as well and I haven't seen it happen yet.
15:22 🔗 alard Maybe the upload limit should just go.
15:23 🔗 alard Some people wanted it in the previous warrior.
15:23 🔗 SmileyG I limit the VM, shrug.
15:23 🔗 DoubleJ Upload limit, as in throughput, or as inwaiting turns?
15:24 🔗 alard Waiting turns. I think the thinking then was that one rsync uploads faster, so can start downloading sooner.
15:24 🔗 alard The opposite of what you say now, basically. :)
15:24 🔗 DoubleJ I can kinda see that, since the overhead for switching wouldn't help overall.
15:24 🔗 SmileyG wasn't it because the upload location was really slow at one point?
15:24 🔗 SmileyG and no one could finish anything :D
15:24 🔗 SmileyG ended up eating all the space on the warriors.
15:25 🔗 DoubleJ Is there someplace I can set it to let 2 upload at once, see if there are any wins to be had that way?
15:26 🔗 SmileyG yup
15:26 🔗 SmileyG you running vm?
15:26 🔗 SmileyG I have upto 6 uploads at once.
15:26 🔗 DoubleJ Yes.
15:26 🔗 SmileyG ok, on the vm window
15:26 🔗 SmileyG alt+F3
15:26 🔗 DoubleJ OK, log in to the VM. Got that.
15:26 🔗 SmileyG nano -w /home/warrior/projects/webshots/pipeline.py
15:27 🔗 SmileyG ctrl+w
15:27 🔗 DoubleJ (Well, I will have that, about 6:00 tonight. can't access theVM from work :) )
15:27 🔗 SmileyG Ah ok
15:27 🔗 SmileyG I need to do a page on this on the wiki
15:27 🔗 DoubleJ But keep going. I'll check the scrollback tonight.
15:29 🔗 DoubleJ alard: Dunno what project it was requested for, but webshots may just be a different critter. Large variation in upload sizes. Waiting is probably still good, we just might want to be smarter about the criteria for deciding who's next :)
15:29 🔗 DoubleJ But the current warrior wins on simplicity.
15:29 🔗 alard Is it worth removing the limit?
15:29 🔗 SmileyG type LimitConcurrent and hit enter, and change the 1 to 6 (or whatever figure)
15:29 🔗 DoubleJ (At least, I think it does. I can read Python about as well as I can read Japanese. (Not at all.))
15:30 🔗 DoubleJ I'll try mine tonight. It may let smaller files squeak out, butit may also take longer because of drive-spinning at either end.
15:32 🔗 alard Word of caution: if you change the pipeline.py in your warrior, you may break future updates. (If git can't figure out how to apply the update to your modified version.)
15:32 🔗 SmileyG heh, i seem to have breoken it anyway ¬_¬
15:32 🔗 SmileyG still getting no output
15:33 🔗 alard Stop the project, go into your warrior and use git pull to figure out what's wrong?
15:33 🔗 DoubleJ Understood. But define "break". Update won't apply, warrior will conk out, house burns down, what?
15:33 🔗 alard I think you can expect the SmileyG problem.
15:34 🔗 DoubleJ Ah.
15:34 🔗 SmileyG webserver runs, nothing else does :D
15:34 🔗 alard So you'll have to login, use git pull to figure out what's going wrong.
15:34 🔗 DoubleJ And as we're talking about it my 261-meg user finishes:)
15:35 🔗 primus alard, would it work to just delete project and restart warrior?
15:35 🔗 SmileyG alard: I'd vote for keep the limit, but add option to change it.
15:35 🔗 alard SmileyG: Is that worth stopping every warrior? (That's what happens if I push an update. Every warrior will finish its current task and restart the project.)
15:36 🔗 alard primus: That would work.
15:36 🔗 SmileyG alard: can't you just do the update and let them pull it in time?
15:36 🔗 DoubleJ Yeah, restarting warriors on this project I think is worse.
15:36 🔗 alard Define "in time"?
15:36 🔗 SmileyG when ever they restart their vm?
15:36 🔗 alard No. They check for updates on github.
15:36 🔗 SmileyG Also, add "Check for updates" button to settings page?
15:36 🔗 DoubleJ Heh. Like Windows Update. "Updates to this warrior are now available. Apply? This may require your warrior to restart."
15:36 🔗 primus lol
15:37 🔗 SmileyG where do I run the git pull?
15:37 🔗 alard What we should have, in a future version, is a gradual update.
15:37 🔗 alard cd /home/warrior/projects/$project/
15:37 🔗 alard (perhaps su -u warrior first)
15:38 🔗 SmileyG hmmm its moanin about the changes in pipeline
15:39 🔗 * SmileyG changs it back and git pulls
15:39 🔗 DoubleJ It'd probably be an awful bitch, but would the multiple-project stuff be useful for that? So /home/warrior/projects/$project.$version instead? Let one run out while the new one sees threadsdisappear and spins up?
15:39 🔗 DoubleJ s/stuff/idea/
15:40 🔗 SmileyG alard: ok I see the new rsync code...
15:40 🔗 SmileyG need to restart the warrior for web interface to update?
15:41 🔗 SmileyG or is it only set via the code (And won't this then cause git to explode again?)
15:41 🔗 SmileyG :O
15:41 🔗 SmileyG ITS GONE CRAZY
15:41 🔗 SmileyG 15 users and counting on one screen
15:43 🔗 SmileyG There we go...
15:43 🔗 SmileyG that is bonkers when it first starts up
15:43 🔗 SmileyG you just see hundreds of boxes popping up
15:44 🔗 SmileyG alard: I remember - The script to create the 50Gb tars couldn't keep up for fortuneCity, thats why the rsync got limited.
15:54 🔗 alard DoubleJ: Yes, that's similar. (I was thinking it might be better to have the cloned git repo in /home/warrior/projects/$project, as the most up-to-date version, then do a clone to /data/projects/$project.$version before starting a project.)
16:37 🔗 alard Have we killed fos?
16:38 🔗 SmileyG :O
16:39 🔗 SmileyG 2Kb/s! \o/
16:39 🔗 SmileyG Oh its coming back now
16:40 🔗 SmileyG Planned Delivery Date
16:40 🔗 SmileyG Wednesday 10th October
16:40 🔗 SmileyG Planned Delivery Time
16:40 🔗 SmileyG Between 07:30 and 17:30
16:40 🔗 SmileyG Wed Oct 10 17:40:33 BST 2012
16:40 🔗 SmileyG HERP?
17:08 🔗 joepie91 HEY
17:08 🔗 SmileyG yeah the uploads are totally dead?
17:08 🔗 joepie91 primus
17:08 🔗 joepie91 :(
17:08 🔗 joepie91 you've overtaken me
17:08 🔗 joepie91 SmileyG: ?
17:08 🔗 SmileyG 4587520 39% 12.21kB/s 0:09:45
17:08 🔗 SmileyG [sender] io timeout after 300 seconds -- exiting
17:09 🔗 joepie91 sec
17:09 🔗 joepie91 wtf, mine is dead
17:09 🔗 SmileyG Retrying RsyncUpload for Item jpr.tree after 30 seconds...
17:13 🔗 SmileyG .... brokeyd :D
17:13 🔗 SmileyG alard: did you break something :(
17:21 🔗 joepie91 my rsyncs are dying..
17:21 🔗 joepie91 rsync: failed to connect to fos.textfiles.com: Connection timed out (110)
17:21 🔗 joepie91 Process RsyncUpload returned exit code 10 for Item andrewjjstanley
17:21 🔗 joepie91 Retrying RsyncUpload for Item andrewjjstanley after 30 seconds...
17:21 🔗 joepie91 rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.7]
17:22 🔗 SmileyG yah
17:22 🔗 SmileyG :<
17:23 🔗 SmileyG they retry, but still its killed all progress :<
17:23 🔗 joepie91 oh
17:23 🔗 joepie91 they run now
17:24 🔗 alard http://isup.me/fos.textfiles.com
17:26 🔗 alard I think this is a SketchCow problem.
17:27 🔗 SmileyG :<
17:27 🔗 alard (The warriors will retry 50 times with 30 second pauses before they fail.)
17:28 🔗 SmileyG :< herp.
17:34 🔗 joepie91 alard: it responds to ping
17:46 🔗 SmileyG alard: se16 0MB << hey look :D
18:21 🔗 joepie91 SmileyG: mmm
18:21 🔗 joepie91 it's probably because he replaced the index page
18:22 🔗 SmileyG joepie91: yeah I figured it might be that.
18:22 🔗 SmileyG well it makes sense, the script forwards you off site.
18:41 🔗 underscor fos is currently down-ish
18:41 🔗 underscor fyi
18:41 🔗 chronomex ish
18:41 🔗 chronomex how can a box be down-ish
18:42 🔗 SketchCow He's mincing words.
18:42 🔗 underscor it still pings
18:42 🔗 SketchCow It's down.
18:42 🔗 SketchCow It's superdown.
18:42 🔗 underscor VMs at archive have 3 states. Up, nossh/services, and noping
18:43 🔗 underscor anyway, yeah, it's turbofucked
18:46 🔗 Nemo_bis how does tpb fetch Google Books' stuff? does it accept suggestions? http://lists.wikimedia.org/pipermail/wikisource-l/2012-October/001204.html
18:49 🔗 underscor wait
18:49 🔗 underscor how is rsync still working if fos is down :O
19:13 🔗 SketchCow OKAY HI
19:13 🔗 SketchCow NEED HELP
19:14 🔗 SketchCow https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0
19:14 🔗 SketchCow OK, that's a listing of all archiveteam projects on archive.org.
19:14 🔗 SketchCow 1. Please see if I missed any.
19:15 🔗 SketchCow (i.e. just browse through the archiveteam set to see)
19:28 🔗 underscor haha, I love the item counts
19:28 🔗 underscor 26, 70, 29, 3956
19:35 🔗 chronomex is IA down? not working for me.
19:39 🔗 godane its not working for me too
19:39 🔗 chronomex k
19:42 🔗 SmileyG SketchCow: you missed the most famous of all - geocities.
19:45 🔗 joepie91 heh
19:45 🔗 joepie91 okay, maybe a recursive grep through my entire repository folder was a bad idea
19:46 🔗 alard Geocities isn't warc.
19:46 🔗 underscor IA is fucked right now
19:46 🔗 underscor please leave a message after the beep
19:46 🔗 underscor :D
19:46 🔗 * chronomex waits for the beep
19:46 🔗 underscor boop
19:46 🔗 * SmileyG hears helicopters
19:47 🔗 underscor But yeah, it's down. Once of the core boxes decided to take a dump all over everything, people are working on fixing now
19:47 🔗 chronomex ok, I'm not in a hurry
19:47 🔗 joepie91 underscor: wat
19:47 🔗 joepie91 IA went down?
19:48 🔗 underscor it's down right now
19:48 🔗 SmileyG we broke it ¬_¬
19:48 🔗 underscor lol
19:49 🔗 joepie91 oh wow
19:50 🔗 alard Can't edit the list, but Cinch is missing. City of Heroes (two items, I think: boards and www).
19:52 🔗 alard Qaudio.
20:04 🔗 joepie91 god I hate efnet
20:05 🔗 joepie91 anyway
20:05 🔗 joepie91 is anyone up for testing a useful script?
20:05 🔗 joepie91 wrote a script that takes a glob pattern, then tries to figure out (from extension) what kind of archive each file is, and prints the decompressed contents to stdout using the appropriate application, without actually unpacking it
20:05 🔗 joepie91 consider it a 'cat' for archives :)
20:14 🔗 SmileyG so like zcat?
20:15 🔗 chronomex igelritte: you know you can be in multiple channels at once, right?
20:15 🔗 underscor igelritte: yeah, most of us are in both
20:16 🔗 chronomex well, actually, I don't know how to do it with pidgin
20:16 🔗 chronomex but I think you can
20:16 🔗 underscor just /j #channel1 and /j #channel2
20:16 🔗 underscor they open up as tabs
20:16 🔗 underscor at least in my pidgin
20:17 🔗 igelritte yeah, I didn't think about it
20:17 🔗 igelritte whateve's. I'm here now
20:18 🔗 chronomex k
20:19 🔗 igelritte so, tell me more about your structure and how one can plug in.
20:21 🔗 igelritte Is it some starry-eyed-open-source-free-for-all? Or is there a process wherein you tell a gatekeeper what you can do, what you're experienced with, and then they tell you where you can start helping?
20:22 🔗 chronomex freeforall.
20:23 🔗 igelritte I've seen Mr. Scott's presentation at Defcon on how AT is going to save your shit...which sounds good to me...but that doesn't tell me a lot about how the group is organized.
20:23 🔗 SmileyG some people write code
20:23 🔗 SmileyG I appear and make comments
20:23 🔗 SmileyG most people run some sort of downloaders
20:23 🔗 SmileyG godane is ..... well I don't know :D
20:24 🔗 mistym There are often projects you can help in by running code written by others, basically volunteering your bandwidth to help out.
20:24 🔗 chronomex godane is affiliated but mostly works on solo projects
20:24 🔗 mistym Those are usually advertised on the wiki and IRC, plus I think there's a mailing list for it now too.
20:24 🔗 igelritte Unfortunately, I'm not really in a good position at the moment to run downloaders or anything else that requires a 24 hour network connection.
20:24 🔗 SmileyG If you haven't got bandwidth, then you can help with the wiki and possibly coding...
20:25 🔗 SmileyG doesn't need 24hr, it'll work when you can
20:25 🔗 SmileyG upto a point
20:25 🔗 DFJustin joepie91: that already exists as lsar in The Unarchiver, although it's all built-in and not invoking other apps
20:25 🔗 igelritte I'm following this silly dream about living in Germany which means that my current address is--shall we say--fluid.
20:25 🔗 DFJustin oh wait I'm wrong nm
20:26 🔗 DFJustin keep forgetting unix cat is not the same as apple II cat :)
20:26 🔗 igelritte Are most people in North America?
20:26 🔗 chronomex a good number but by no means all
20:26 🔗 SmileyG i'm UK
20:27 🔗 igelritte I got that from the presentation. Something about a kid of 15 in Australia being threatened with legal action for downloading poetry.
20:27 🔗 DFJustin igelritte: jason is in the gatekeeper role more or less, or cat herder if you prefer
20:27 🔗 chronomex in order probably US, UK, AU, .eu
20:27 🔗 igelritte Jason seems to do a lot.
20:29 🔗 DFJustin but there's a lot of empowerment if you see something to just do it yourself
20:29 🔗 igelritte Well, I can definitely help with the wiki
20:30 🔗 igelritte when you say, 'coding', what do you mean?
20:30 🔗 soultcer Programming stuff that downloads stuff
20:30 🔗 igelritte I have a fair amount of experience with BASH scripting
20:31 🔗 igelritte what are you guys using to download stuff?
20:31 🔗 DFJustin perfect
20:31 🔗 DFJustin primarily wget
20:31 🔗 igelritte oh, hold on their solder, my BASH scripting is far from perfect
20:31 🔗 joepie91 DFJustin: The Unarchiver sounds like a comic hero :P
20:31 🔗 DFJustin it's like a real life superhero
20:31 🔗 igelritte but I have written some stuff using wget to batch download stuff for myself
20:32 🔗 DFJustin the main difference is we use a parameter to wget to have it produce .warc files which are a full record of HTTP headers etc. suitable for going into the wayback machine
20:32 🔗 igelritte lectures from the opencourse ware project at MIT
20:32 🔗 igelritte hmmm
20:33 🔗 alard Yes, so if you download anything for archiving, use the --warc-file option (available in Wget 1.14).
20:34 🔗 igelritte hmmm. It appears that the wget that comes with Ubuntu these days is 1.13
20:34 🔗 igelritte at least, so says dpkg
20:35 🔗 mistym You'll need to build it yourself then (or grab a newer package). .warc support wasn't added until 1.14.
20:35 🔗 DFJustin for our big multi-user projects we supply a ready-made VM with everything all set up and just a go button to push
20:35 🔗 igelritte okay
20:35 🔗 igelritte um, what are warc files and why use them?
20:36 🔗 DFJustin warc is a standardized format for web archives, it includes all the HTTP response data from the server (not just the file contents) so that you can "play it back" with a proxy and duplicate the original site exactly
20:36 🔗 igelritte You'all are interested in full HTTP headers, or the way back machine?
20:36 🔗 igelritte interesting
20:37 🔗 igelritte very interesting
20:37 🔗 DFJustin the main impetus is that it's a requirement for wayback to integrate the data (proper timestamps are a necessity, for example)
20:37 🔗 igelritte Okay, I can see what you're saying
20:38 🔗 DFJustin everyone grabbed geocities kind of higgledy-piggledy and it's hard to pin down the dates for anything because of filesystems, time zones, modification time vs download time etc
20:39 🔗 DFJustin so the later projects have been standardized on warc
20:39 🔗 igelritte The Geocities project was quite an accomplishment
20:41 🔗 DFJustin warc is big with the pointy-headed academic world because of formal documentation etc. so that gives us an in with that crowd too
20:41 🔗 DFJustin unfortunately the end user tools for it are not great yet
20:43 🔗 igelritte I loved Jason's picture of the datacenter where the nine terabytes where housed. It reminded me of this scene from 'Connections'--that interesting spin on discovery and invention that came out in the 70's by James Burke--where he holds up an old tape cartridge and expounds: "this device holds one million characters," in that tone of voice like the audience is supposed to piss themselves in amazement. You then do the math and realize that
20:43 🔗 joepie91 DFJustin: is there a format specification for warc?
20:43 🔗 joepie91 one that is publicly accessible
20:44 🔗 DFJustin ISO 28500
20:45 🔗 joepie91 CHF 122,00
20:45 🔗 joepie91 eh.
20:46 🔗 joepie91 DFJustin; anything or any place that *doesn't* want to see the inside of my wallet?
20:46 🔗 joepie91 :|
20:46 🔗 DFJustin obviously, you can google it just as well as I can though
20:46 🔗 joepie91 yes, and I only get drafts
20:47 🔗 joepie91 do I seriously have to pirate a document to figure out what warc looks like
20:47 🔗 joepie91 :|
20:47 🔗 igelritte I have to say that you folks seem down right Edwardian in your manners. Most of my experiences in chatrooms with techsavy folks have not been so pleasant.
20:48 🔗 SmileyG :D
20:48 🔗 SmileyG Most people suck.
20:48 🔗 SmileyG I think the fact everyone is here because they care about it helps, rather than being here because of "work" or other reasons.
20:49 🔗 DFJustin my suspicion is that the 0.18 draft is the same as the final because international standards move slow but I'll defer that to somone whose head is pointier :)
20:49 🔗 igelritte I was working on Linux from Scratch a few years back; their IRC...well, let's just say that you need a thick skin.
20:49 🔗 alard I believe the bib-something site has a PDF of a draft of the warc spec.
20:49 🔗 alard The warc people at archive.org assured me that that's what they use.
20:49 🔗 igelritte And none of those people were there for work...
20:49 🔗 DFJustin http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf
20:50 🔗 SmileyG ah yeah hmm
20:50 🔗 alard That's it. Just change the version header WARC/0.18 with WARC/1.0, or something.
20:50 🔗 SmileyG igelritte: I've been "both" sides of the arguement
20:50 🔗 alard There's also a warc implementation guidelines somewhere.
20:51 🔗 joepie91 alard: the draft is representative?
20:51 🔗 * joepie91 really hates 'standards' that you can't just view
20:52 🔗 alard Yes, I believe so. The Heritrix implementation is based on the same draft, so that's something.
20:52 🔗 igelritte Tell me about it joepie91. I worked in Teleco for years. Any idea what they want for a membership to the ITU?
20:52 🔗 alard http://netpreserve.org/publications/WARC_Guidelines_v1.pdf
20:52 🔗 joepie91 igelritte: not sure I even want to know the amount of digits
20:53 🔗 igelritte It's pretty gross
20:53 🔗 joepie91 alard: that 404s
20:53 🔗 joepie91 anyhow, I'll use the bibnum one then
20:54 🔗 alard Does it? I just copied the link I put on the wiki months ago. :)
20:54 🔗 SmileyG http://archiveteam.org/index.php?title=BT_Internet C-, needs work
20:54 🔗 SmileyG :D
20:54 🔗 alard http://www.netpreserve.org/resources/warc-implementation-guidelines-v1
20:54 🔗 alard http://www.netpreserve.org/sites/default/files/resources/WARC_Guidelines_v1.pdf
20:55 🔗 joepie91 thankies
20:55 🔗 alard (It's pretty silly that an "internet preservation consortium" doesn't have stable urls.)
20:55 🔗 DFJustin one of the nice things about WARC though is it's basically human readable, you open it up and bam headers
20:55 🔗 DFJustin so it's reasonably future-proof
20:57 🔗 joepie91 lol alard
21:00 🔗 SmileyG Can't upload images to wiki?
21:00 🔗 igelritte When you watch Jason's presentation at Defcon, you know that other people are involved and that recruits are needed, but the specifics are still a little vague. I guess that I've spent so much time interacting with organizations by being told what to do that the free-for-all comes off as very chaotic. Still not very sure where I can plug in.
21:00 🔗 SmileyG why didn't I see "upload file" ? XD
21:00 🔗 joepie91 hmm, interesting... http://www.webarchivingbucket.com/
21:00 🔗 joepie91 igelritte: link to presentation?
21:01 🔗 igelritte sure
21:02 🔗 DFJustin well our formal projects now are all "run the warrior VM" where we tell your computer exactly what to do
21:02 🔗 joepie91 www.btinternet.com/~catechnology
21:02 🔗 joepie91 www.btinternet.com/~ted.power
21:02 🔗 joepie91 www.dgsgardening.btinternet.co.uk
21:02 🔗 joepie91 www.mstracey.btinternet.co.uk
21:02 🔗 joepie91 cc alard
21:02 🔗 DFJustin it's just that on top of that people have their own archiving side projects that are related to the mission in varying degrees
21:02 🔗 alard joepie91: http://tracker.archiveteam.org/webshots/rescue-me
21:03 🔗 joepie91 alard: webshots?
21:03 🔗 joepie91 shouldn't that be btinternet?
21:03 🔗 alard Oops, sorry, http://tracker.archiveteam.org/btinternet/rescue-me
21:03 🔗 joepie91 :P
21:03 🔗 DFJustin is that expecting urls or user names
21:03 🔗 alard usernames
21:04 🔗 joepie91 0 items added to the queue
21:04 🔗 joepie91 Thanks for your help!
21:04 🔗 joepie91 lol
21:04 🔗 alard Heh.
21:05 🔗 alard The tracker really appreciates your contribution, it just wasn't useful. :)
21:05 🔗 joepie91 haha
21:06 🔗 joepie91 looks like catarc works well :)
21:06 🔗 joepie91 http://sebsauvage.net/paste/?9e695a09848493ea#Yy3GjmiyMI4bfhUcKv9vahutcX48KTJBHLivJh8l2BU=
21:06 🔗 DFJustin nice regex
21:07 🔗 underscor <igelritte> I got that from the presentation. Something about a kid of 15 in Australia being threatened with legal action for downloading poetry.
21:07 🔗 underscor I can't remember
21:07 🔗 underscor hahahahahaha
21:07 🔗 underscor was that bluemax?
21:08 🔗 SmileyG what happened with htat o_O
21:08 🔗 underscor joepie91: we conform to the draft fyi
21:09 🔗 SmileyG http://archiveteam.org/index.php?title=BT_Internet <<< wtf is iwth the no description below the imae
21:09 🔗 joepie91 ok, thanks :P
21:09 🔗 SmileyG image
21:09 🔗 underscor we being archive.org
21:09 🔗 underscor SmileyG: lulu poetry's IT department sent a scary letter to him
21:09 🔗 underscor "scary" "letter"
21:10 🔗 SmileyG o
21:11 🔗 joepie91 igelritte: does a video of the defcon presentation exist?
21:11 🔗 joepie91 I can't find it
21:14 🔗 alard SmileyG: The "No description" comes from the image, I think.
21:14 🔗 SmileyG except it has a description :/
21:16 🔗 Dark-Star problems with the archive? I'm getting "rsync: failed to connect to fos.textfiles.com: Connection timed out (110)" all the time
21:16 🔗 alard SmileyG: Oh. Then maybe it's in the template? http://archiveteam.org/index.php?title=Template:Infobox_project&action=edit
21:18 🔗 underscor Dark-Star: it's down atrm
21:18 🔗 underscor atm*
21:19 🔗 Dark-Star ah okay. I'll just leave the Warrior running overnight then. I guess it'll automatically resume the upload later
21:23 🔗 SmileyG alard: ah yeah hmmm :S
21:24 🔗 SmileyG weird because the mobile me one doesn't do it
21:26 🔗 igelritte right on...I'm not as stupid as I originally suspected
21:26 🔗 igelritte GNU Wget 1.14 built on linux-gnu.
21:27 🔗 igelritte I now have the ability to support warc
21:27 🔗 igelritte though, my dpkg still thinks that I'm working with 1.13
21:28 🔗 igelritte It's probably been six months or more since I've compiled and installed anything from scratch. It's funny how quickly you forget that shit.
21:28 🔗 alard igelritte: I don't want to temper your enthusiasm and sense of achievement, but you might want to check if your new Wget includes gzip and SSL support. It's in wget -V, I think.
21:30 🔗 igelritte well, I'm pretty sure that it does because I kept getting an SSL error and had to dig into why and then install libcurl and libgnutls dev packages in order to get wget to compile correctly
21:30 🔗 igelritte but I will check
21:30 🔗 alard Ah good, then it'll probably work.
21:30 🔗 alard soultcer: Starting TinyBack for Item
21:31 🔗 alard (Hint: the git clone it's very slow if there's no .git in the repository url: https://github.com/soult/tinyback.git )
21:32 🔗 soultcer It is? Damn, I always felt so clever because I had to type 4 characters less
21:32 🔗 igelritte well, right under the version number, you get the following list: +digest +https +ipv6 +iri +large-file +nls -ntlm +opie +ssl/gnutls
21:32 🔗 soultcer http://tracker.tinyarchive.org/v1/ <-- "ranking"
21:33 🔗 alard soultcer: It's strange, because it does seem to work, but it just takes a long time. I was wondering what my warrior was doing.
21:33 🔗 igelritte I'm not sure about the 'wget-V, I' syntax...is that supposed to be 'wget -V -I'?
21:33 🔗 igelritte or really a comma
21:33 🔗 alard Heh. The comma and I are part of the sentence. :)
21:33 🔗 * igelritte laughs at self
21:34 🔗 primus igelritte: if you're interested in downloading you can download ArchiveTeam Warrior virtual machine - it has everything already set up. http://archive.org/details/archiveteam-warrior
21:35 🔗 alard To check if you have gzip support, use: wget --help | grep warc-compression and see if it returns something. If it does, it works.
21:35 🔗 igelritte I'm a little limited on what I can do with downloading at the moment. This network connection is not really my own.
21:36 🔗 DFJustin <joepie91> igelritte: does a video of the defcon presentation exist? <-- https://www.youtube.com/watch?v=-2ZTmuX3cog
21:38 🔗 igelritte alard: I get the "no-warc-compression"; I'm guessing that warc uses gzip for compression
21:38 🔗 igelritte ?
21:40 🔗 alard Then your Wget is in top condition. The thing with gzip is: you can make .warc and .warc.gz files. It is much better to do the gzip compression in Wget than to do it afterwards. Wget makes a new gzip record for each downloaded file, so it's possible to extract only part of the .warc.gz. If you use the gzip utility to compress your warc afterwards, you can only decompress everything at once.
21:43 🔗 igelritte Just performed a quick little test where I ran the following: wget --warc-file test http://en.wikipedia.org/wiki/Jason_Scott_Sadofsky. This seems to have created the 'test' file that I asked for.
21:43 🔗 igelritte -rw-rw-r-- 1 23386 Oct 10 23:41 test.warc.gz
21:44 🔗 joepie91 quick question to alard: how does one write a setup.py where the resulting install package will copy a python file to the bin directory?
21:44 🔗 joepie91 /usr/bin etc
21:44 🔗 alard gunzip -c test.warc.gz to look inside
21:45 🔗 alard Why do you think I would know? I'm a copy-paste setup.py writer. :)
21:45 🔗 alard scripts, I think: https://github.com/ArchiveTeam/seesaw-kit/blob/master/setup.py#L41-44
21:46 🔗 joepie91 well, seesaw does it :P
21:46 🔗 joepie91 and alright, thanks
21:46 🔗 alard I thought you were the python distribution / pip / pypi expert. :)
21:47 🔗 igelritte very interesting. That seems to have worked. I DO have an HTTP document. It doesn't look anything like a wiki, but I'm guessing why I know that is.
21:48 🔗 joepie91 alard: oh, not at all
21:49 🔗 joepie91 I just know how to package up a module with an existing setup.py
21:49 🔗 joepie91 :P
21:49 🔗 joepie91 and that's it
21:57 🔗 igelritte so, when I unpack this archive file (warc) I should expect to find nothing put pure HTTP?
21:57 🔗 alard You'll find warc records, some of which have a HTTP body.
21:58 🔗 igelritte hmmm
21:58 🔗 alard You get some warc headers identifying the record (type, target-uri, timestamp etc.), then the http request or response.
21:58 🔗 alard There are special types of warc records with metadata, such as the wget command line and log.
21:59 🔗 alard So it's not the most user-friendly format, you need to work to get the data out.
21:59 🔗 alard The good thing is that everything is in the file, so you *can* get it out.
22:00 🔗 igelritte This is all just for my education; so, feel free to tell me to fuck off when you lose patience. But, where can I find these headers? When I open the file with a text editor, it spears to be just HTML.
22:01 🔗 alard You'll have to look better then, they're in there.
22:01 🔗 alard It starts with WARC/1.0 or something, then there's WARC-Target-URI, etc.
22:04 🔗 SketchCow Hey, so my commentary before.
22:06 🔗 alard It has scrolled away. :)
22:06 🔗 DFJustin SketchCow: http://archive.org/details/archiveteam-city-of-heroes-www is not on the list
22:07 🔗 igelritte crazyness...I just used vi on the test.warc.gz file and the headers you mentioned showed up. Vi also showed me all the compressed content. I didn't know that vi could do that...
22:07 🔗 SmileyG SketchCow: geocities - theres a dump on the ia but I can't find it anymore (and it was searchable.... we really need to make those links more accessable...)
22:08 🔗 alard http://archive.org/details/archiveteam-qaudio-rescue
22:08 🔗 alard http://archive.org/details/archiveteam-cinch
22:08 🔗 joepie91 wait wait wait wait, what? Jeroenz0r is/was part of urlteam?
22:09 🔗 SketchCow Only WARC items. So Geocities proceeds that.
22:10 🔗 SmileyG ah k
22:12 🔗 igelritte Perhaps I'm really thick here...and that wouldn't be a surprise...but I'm still not seeing how I can contribute. Is there a list of "shit that needs to get done and we'd be thrilled if you'd take it on" some where?
22:12 🔗 SketchCow Both added, alard
22:12 🔗 SketchCow What's your skillset, igelritte?
22:12 🔗 DFJustin various godane grabs(tm) at https://archive.org/search.php?query=warc%20uploader%3A%22slaxemulator%40gmail.com%22
22:13 🔗 alard There are some groklaw.net warcs: http://archive.org/details/groklaw.net-pdfs-2004-20120827
22:13 🔗 igelritte Well, I've done some BASH scripts. I'm trilingual. I've done lots of networking.
22:13 🔗 igelritte And there's a bunch of voip in there too
22:13 🔗 alard http://archive.org/search.php?query=groklaw%20warc
22:14 🔗 joepie91 igelritte: is there any chance you can turn the install script for the webshots script, into something more sane?
22:14 🔗 joepie91 because I suck at bash :P
22:14 🔗 igelritte I'm not that awesome at it either, but I can look at it.
22:14 🔗 joepie91 current script is at http://cryto.net/projects/webshots/webshots_debian.sh
22:14 🔗 alard http://archive.org/search.php?query=warc%20journalstar (but it's getting more obscure now)
22:14 🔗 joepie91 thanks :)
22:16 🔗 igelritte Hmmm...
22:16 🔗 nintendud joepie91: you can set a trap on error to avoid all the conditionals
22:16 🔗 igelritte this could use some commenting and perhaps a header
22:16 🔗 nintendud and then have it print "Error on line x". Not as nice of a message though.
22:17 🔗 igelritte who wrote this? And why are they doing an apt-get at the beginning?
22:17 🔗 joepie91 igelritte: I did
22:17 🔗 joepie91 and the apt-get is to install dependencies
22:18 🔗 alard http://archive.org/search.php?query=uploader%3A%28slaxemulator%40gmail.com%29%20AND%20warc
22:18 🔗 DFJustin is there an echo in here
22:18 🔗 igelritte I think I see what you're doing here, and I understand why you would do an apt-get update before doing an install
22:18 🔗 alard DFJustin: Oh, sorry. :)
22:18 🔗 igelritte but, I don't think I understand enough of the purpose here to understand why you would do that in a script
22:19 🔗 joepie91 igelritte: it's apt-get update, not upgrade
22:19 🔗 joepie91 just updates the package list
22:19 🔗 igelritte I'm guessing that my ignorance is to blame
22:19 🔗 igelritte right
22:19 🔗 nintendud joepie91 / igelritte: here's a nice article on BASH traps, btw. http://phaq.phunsites.net/2010/11/22/trap-errors-exit-codes-and-line-numbers-within-a-bash-script/
22:19 🔗 igelritte typo on my part
22:19 🔗 joepie91 had it break for some people because the package lists weren't up to date, so that's why update is there :)
22:21 🔗 nintendud joepie91: also, why are you using useradd? On Debian, you're supposed to use the adduser command afaik
22:21 🔗 joepie91 adduser is interactive
22:21 🔗 nintendud Doesn't have to be
22:21 🔗 nintendud At least, I think you can make it a one-liner
22:21 🔗 joepie91 iirc I haven't found a way to make it not interactive
22:21 🔗 joepie91 :P
22:22 🔗 joepie91 anyway, any particular reason not to use useradd?
22:22 🔗 nintendud Does useradd make the home directory?
22:22 🔗 joepie91 yes
22:22 🔗 nintendud o
22:22 🔗 nintendud Welp, adduser just follows a nice configuration file that specifies things like the permissions to set on the home directory among other things
22:23 🔗 nintendud But I guess useradd works OK. I was just curious. :-)
22:36 🔗 DFJustin SketchCow: there are more qaudio items, http://archive.org/details/archiveteam-qaudio-archive-1 through http://archive.org/details/archiveteam-qaudio-archive-7
22:39 🔗 DFJustin also fan fiction http://archive.org/search.php?query=%22fan%20fiction%22%20archiveteam
22:41 🔗 joepie91 right
22:41 🔗 joepie91 pip install catarc
22:41 🔗 joepie91 :)
22:41 🔗 joepie91 cat for archives
22:48 🔗 SketchCow OK, so I got out of a meeting about incorporating archive team stuff into wayback
22:48 🔗 SketchCow NATURALLY it's slightly more complicated in some cases.
22:49 🔗 SketchCow Let me make some changes to the thing.
22:52 🔗 chronomex of course it is
22:52 🔗 chronomex what kind of changes do they want?
22:55 🔗 SketchCow Look at the document again. All green ones are cleared for takeoff.
22:55 🔗 chronomex wow, awesome
22:57 🔗 chronomex so looks like they can just suck in warc-in-nothing, yes?
22:59 🔗 SketchCow Yes
22:59 🔗 SketchCow They cannot suck in warc-in-archives
22:59 🔗 SketchCow So, next step is to look at the archives ones and see if there's not too many WARCs in it, say less than 100
22:59 🔗 chronomex I mean "just suck in" as in "point the ingestor at"
22:59 🔗 DFJustin good thing we didn't upload 250tb of that XD
23:00 🔗 chronomex lol yes
23:01 🔗 chronomex mobileme: 280T of .tar containing .warc.gz
23:01 🔗 chronomex soooo
23:02 🔗 SketchCow We're aware of it and there'll be a project to deal with that.
23:02 🔗 SketchCow But I don't want to rush it.
23:02 🔗 SketchCow So Brewster's letting me make doubled files for weird ones.
23:06 🔗 DFJustin even if there's a shitload of warcs inside they can all be cat-ed together into one megawarc right
23:07 🔗 arkhive is there a webshots tracker I can check the progress? (I'm unable to help, I'm just curious how it's going)
23:07 🔗 DFJustin http://tracker.archiveteam.org/webshots/
23:07 🔗 arkhive thank you :)
23:08 🔗 DFJustin underscor making his isp cry again
23:08 🔗 SketchCow YEah, but the machine is still down
23:08 🔗 SketchCow so I don' know what's going on
23:08 🔗 SketchCow DFJustin: Yes, exactly.
23:13 🔗 joepie91 alard: what about an 'assorted' warrior project
23:13 🔗 joepie91 with things that are small or heavily rate-limited (like some urlteam targets)
23:13 🔗 joepie91 that the warrior automatically switches to whenever it has nothing else to do
23:14 🔗 chronomex that sounds cool.
23:14 🔗 joepie91 for example, if the current selected project is done
23:14 🔗 joepie91 a "let's not waste any time or bandwith that we have" mode, so to say :P
23:14 🔗 chronomex urlteam is a basically-no-bandwidth project, it might actually make more sense to run it in the background always.
23:15 🔗 joepie91 maybe have an 'always running' *and* 'assorted' project
23:15 🔗 chronomex yeah
23:15 🔗 joepie91 separate projects... one always runs, like urlteam
23:15 🔗 joepie91 and assorted is filled with whatever small project is happening that doesn't warrant its own separate project, really
23:15 🔗 chronomex 'assorted' would be filler for "let archiveteam choose"
23:15 🔗 joepie91 as a fallback when it has nothing better to do
23:15 🔗 joepie91 well yes, but the thing is
23:16 🔗 joepie91 say that I've got it configured for btinternet
23:16 🔗 joepie91 the moment btinternet is done, which will be soon
23:16 🔗 joepie91 my warrior will be bored out of its skull, no?
23:17 🔗 chronomex yes
23:18 🔗 joepie91 would be good if it switched to 'assorted' then :P
23:18 🔗 joepie91 'let archiveteam choose' has a pretty different function
23:18 🔗 joepie91 that option should always refer to the most urgent project
23:18 🔗 joepie91 such as, in this case, webshots
23:18 🔗 joepie91 assorted would have the stuff that isn't really urgent or significant, but has to be done anyway
23:18 🔗 joepie91 at some point in time
23:21 🔗 chronomex ah
23:36 🔗 flaushy hi, is fos.textfiles.com down?
23:36 🔗 joepie91 it is
23:37 🔗 flaushy rsync will happily retry until it reappears, right?
23:37 🔗 joepie91 if I recall correctly, it will retry 50 times
23:37 🔗 joepie91 before giving up
23:37 🔗 joepie91 alard can probably confirm on that
23:37 🔗 flaushy :( 50k link user in queue
23:37 🔗 SketchCow Fortress of Solitude is Back
23:37 🔗 joepie91 ouch
23:37 🔗 joepie91 oh, it is?
23:38 🔗 joepie91 SketchCow: my warrior disagrees
23:38 🔗 joepie91 rsync: failed to connect to fos.textfiles.com: Connection timed out (110)
23:38 🔗 flaushy same here, but i guess it will work soon then :)
23:39 🔗 flaushy probably we are just hammering it currently
23:39 🔗 flaushy and thx for the info!
23:39 🔗 joepie91 aaaaand there it went
23:39 🔗 joepie91 :D
23:41 🔗 SketchCow Hooray, 517 rsync connections.
23:41 🔗 joepie91 lol
23:41 🔗 flaushy working for me now too :)
23:42 🔗 joepie91 :|
23:42 🔗 joepie91 uploads just died
23:42 🔗 joepie91 like, literally flatlined
23:42 🔗 joepie91 ah, it resumed
23:42 🔗 joepie91 and flatlined again
23:42 🔗 joepie91 wat
23:43 🔗 DFJustin alard: you wanna run through the usernames in these https://en.wikipedia.org/wiki/Wikipedia:Bot_requests#btinternet
23:43 🔗 igelritte so, from the following, I can assume that fos = fortress of solitude and that this is some place where folks are trying to rsync there current downloads to. Feel free to direct me to a link that will shut me up.
23:43 🔗 igelritte *thier
23:43 🔗 igelritte or maybe their
23:44 🔗 joepie91 igelritte: yes, fos is where the uploads go
23:44 🔗 igelritte At some point grammar will come bck to me
23:44 🔗 chronomex until then
23:44 🔗 igelritte indeed
23:46 🔗 flaushy phew... seems like some 1 gb stuff is in queue on nooon
23:50 🔗 joepie91 DFJustin: http://pastie.org/5032511
23:51 🔗 joepie91 is the clean version
23:51 🔗 joepie91 of all usernames for both .com and .co.uk
23:51 🔗 joepie91 sorted, unique
23:51 🔗 joepie91 also cc alard, idk if that list is already in the tracker
23:51 🔗 joepie91 k, time to sleep
23:51 🔗 joepie91 goodnight all :)
23:51 🔗 DFJustin nice thanks
23:59 🔗 SketchCow Well, FOS is getting CRUSHED, we'll see how long this lasts.
23:59 🔗 SketchCow 848 Rsync collection
23:59 🔗 nintendud lol

irclogger-viewer