[00:09] WE'LL FIND OUT [00:24] SketchCow: I'm grabing all of offical xbox magazine podcast [00:24] there is like 311 podcast [00:25] *podcasts [00:25] i'm uploading the rest of no bs podcast now too [02:49] so, I've got all the laptop service manuals from dell's ftp- someone have a place I can upload them to? [11:37] alard: is btinternet a warrior project yet? [11:38] Yes, it's more or less ready (barring any new insights) but it's not actually on the warrior. [11:39] https://github.com/ArchiveTeam/btinternet-grab [11:39] Is ready to go. [11:39] (Almost.) [11:40] Why? [11:42] well, when it's done, my warrior has something important to do :P [11:43] We should keep looking for more usernames, though. [11:43] I added the sites from DMOZ, from the wayback machine and am waiting for the btinternet links on tvtropes.org. [11:45] alright [11:49] I'm now downloading the wikipedia dump as well. [11:50] wikipedia dump? as in, find btinternet links on wikipedia? [11:50] speaking of which.. I'll have a look in the stackexchange dump [11:50] I have it here locally [11:53] joepie91: Yes, bunzip2 | grep ... [11:54] It seems that there are a few links on Wikipedia: https://encrypted.google.com/search?hl=en&q=site%3Awikipedia.org%20btinternet.co.uk [11:55] oh goddamnit, I removed the stackexchange data dump a few days ago [11:55] redownload time [11:59] alard: I think "all Projects" tab in warrior should be "Choose Project" ? [12:00] SmileyG: Perhaps. But "Choose" is a verb. "Settings" is not. Is "Available projects" a solution? [12:01] Yeah that works [12:01] Currently I'd think "All Projects" would select all projects...... make sense? [12:03] Yes, I think I understand your point. (Although you could also say that it's a tab, not a button, so it shows you "all projects", like it does.) [12:03] Hehe [12:03] UI design is hard :P [12:04] Well I have a habit of reading things differently to others, but I was good at it at uni. :S [12:04] It's fun. [12:15] http://tracker.archiveteam.org/btinternet/ [12:16] (Don't go too fast.) [12:16] thanks for reminding me to see how webshots was doing :P [12:16] underscor with 2364GB. [12:16] I'm going to kill him one day [12:19] why does it say only 8 items done so far? :P [12:19] oh I see... [12:19] nvm :P [12:21] balrog_: You could be number 1 with 9! [12:22] alard: do I have to use warrior? [12:22] :| [12:23] What's wrong with the warrior? It's a small project. [12:23] takes up more ram and cpu on my side :/ [12:23] It's pretty minimal how much it takes up. [12:24] BlueMaxim: not exactly [12:24] it uses up to 20% of my 4GB of RAM [12:24] how long til bt ties? [12:24] usually around 13 [12:24] joepie91, seriously? I thought it only needed 256MB of RAM [12:24] BlueMax: that's the VM itself - apparently virtualbox adds a bunch of overhead on top of that [12:24] also [12:24] or something [12:24] it's quite heavy on CPU [12:24] on my shitty notebook i3 [12:25] 2 x 1,3ghz [12:25] hm, these are synging like 6kb [12:25] *syncing [12:25] I guess if there is only one page [12:26] Oh, 404 error, even smaller [12:26] guess I didn't notice [12:27] My computer must be better at this than I thought :P [12:28] can the tracker do less more verbose than 0Mb? [12:30] I ran into virtualbox using 7 gigabytes of ram before it got OOM-killed [12:30] While running the warrior a few days back [12:31] lot of 0MBs [12:31] lol [12:31] memory leak to the max :P [12:32] alard: when should it start new processes :S, I've got it set to 6 but it still only shows 4? [12:33] SmileyG: When an item finishes the warrior checks the number to see how many new items there should be. [12:33] hmmm k [12:33] ones just finished, lets ee if it works this time [12:33] Also I've changed to BT but the banner still shows webshots (I presume because some of the jobs are still webshots). [12:34] have there ever been archiving/warrior projects where the warriors were throttled/rate-limited/blocked? [12:34] SmileyG: it will first finish the webshots jobs [12:34] then move on to BT [12:34] I rate limit mine joepie91 :P [12:34] oooo, 39MB [12:34] The warrior can't run multiple projects at the same time, so yes, it waits for webshots to complete. [12:35] ok, makes sense :D [12:35] (Also: why not keep it on webshots? I expect btinternet won't take long.) [12:35] it'd be cool if it could multitask [12:35] one process on one project, four on another [12:35] I have a webshots running at work on 5Mbit, this is amazingly slow compared to that ;) [12:42] alard: http://www.quickonlinetips.com/archives/2012/09/google-feedburner-shutting-down/ [12:43] not sure if there's any useful data on feedburner [12:43] but sure looks like signs of imminent death [12:43] also http://searchenginewatch.com/article/2213759/Google-Shutting-Down-AdSense-for-Feeds-Classic-Plus-More-Services?utm_source=twitterfeed&utm_medium=twitter [12:43] Isn't that just a proxy/cache/stats service? [12:43] Yeah, it is a stats tracking service for RSS feeds [12:44] So thousands of RSS feeds will break [12:44] but they don't really host much data [12:44] this may also be a problem for THQ-related sites: http://www.gamearena.com.au/news/read.php/5116588 [12:44] THQ Asia Pacific shutting down [12:44] i got to grab my t3 magazine podcast then [12:45] are there any THQ Asia Pacific-run sites that have user content? [12:45] looking now [12:46] Carmeron_D: links to a lot podcasts and stuff could be lost [12:47] http://feeds.feedburner.com/T3/podcast [12:48] feedburner just acts as a proxy though (To collect stats) [12:48] Somewhere on the t3 site is the actual feed [12:48] At least that is how I remember it working [12:49] but that feed i think doesn't go back that far [12:50] Cameron_D: also as an aggregrator afaik [12:50] there only feed is from feedburner [13:07] the warrior image has issues [13:08] first off, vmware complains that it doesn't meet ova specs [13:08] second, I get an error that there's an ide slave with no master [13:08] balrog_: Which image? [13:09] 20121008? [13:09] archiveteam-warrior-v2-20121008 [13:09] yes [13:09] http://dmorton.staff.hostgator.com/archiveteam-warrior-vmware.ova vmware-compatible (albeit an older version) [13:09] why did this one break? [13:10] I don't know about the ova specs. There previously was a problem with the filename. I had exported the image as archiveteam-warrior-v2.ova, and then renamed it to include the date. This new image is exported with the correct name. [13:10] And IDE slave with no master, that seems to be a virtualbox - vmware incompatibility. [13:10] The import failed because /path/to/archiveteam-warrior-v2-20121008.ova did not pass OVF specification conformance or virtual hardware compliance checks. Click Retry to relax OVF specification and virtual hardware compliance checks and try the import again, or click Cancel to cancel the import. If you retry the import, you might not be able to use the virtual machine in VMware Fusion. [13:11] I've added two disks in VirtualBox, but for some reason VMware ends up with two controllers: 1-master for disk 1, 2-slave for disk 2. [13:11] and then ... There is an IDE slave with no master at ide1:1. This configuration does not work correctly in virtual machines. Move the disk/CD-ROM from ide1:1 to ide1:0 using the configuration editor. [13:12] I wouldn't be surprised if VBox is malforming the ova [13:12] VBox is unfortunately full of bugs [13:13] heh, ESXi still rejects the file too http://i.imgur.com/z3Kox.png [13:14] hm, they have an OVF tool [13:16] balrog_ [13:16] are you running vmware workstation? [13:16] no, fusion [13:16] which is basically the mac version of workstation [13:17] when i first imported archiveteam-warrior-v2-20120813 i got the error about it not being valid. then i just imported again and it worked. [13:17] i got the ide error as well after that too [13:17] yeah but I keep getting the ide error [13:17] you just have to go into the settings and change the second drive to ide0:1 [13:17] from ide 1:0 [13:23] hmm [13:23] what if someone imported the vm into vmware, fixed it, and exported it? [13:23] I wonder if the ova file would be more up-to-spec [13:25] youd probably want to export as a vmdk or wahtever the vmware equivlent is. you can always just rar up the vmdk files and if someone uses them vmware will just ask if they copied it [13:25] alard: btinternet\.(com|co\.uk) [13:25] right? [13:25] ova is better if it's compatible [13:25] err, compliant [13:25] apparently vbox does't produce compliant files [13:26] bingo [13:26] http://www.btinternet.com/~se16/hgb/statjoke.htm [13:26] se16 :P [13:27] uploaded: http://archive.org/details/cdrom-linuxformatmagazine-76 [13:27] joepie91: Yes, and then www\.(.+)\.btinternet or /~([^%?/]+) [13:28] Final webshots rsync finishes in a few min and then bt ':D [13:29] alard: I've also seen a few *without* www in front [13:29] and just the username [13:31] alard: 7z e -so *.7z | grep -P "(([^\s(/]+)\.)?btinternet\.(com|co\.uk)(\/~([^/ %?]+))?" [13:31] :) [13:31] will take a few hours for the torrent to finish downloading [13:31] after that, that will yield all the relevant entries [13:36] better: [13:36] 7z e -so *.7z 2> /dev/null | grep -Po "(([^\s(/]+)\.)?btinternet\.(com|co\.uk)(\/~([^/ %?]+))?" [13:57] how well does warrior handle a network connection change? [14:01] how well does warrior handle a network connection change? [14:01] also, why no rsync with continue? [14:05] balrog_: it should back off then continue once it figures it out [14:06] you mean with the wget? [14:06] rsync seems to lack continue though... [14:08] Doesn't --partial-dir enable --partial? [14:08] (Just rsync --partial is dangerous in this case, since SketchCow will move any file in the upload directory.) [14:22] Hey there, if you see my name on uncompleted webshots job please release the lock. [14:25] willwill: No problem. (There will probably be other failed jobs, so I'll requeue them all at once later.) [14:46] balrog_: rsync, continue? [14:46] rsync knows what its sent and it doesn't require continue [14:46] resume rather [14:47] --partial or -P switch [14:47] doesn'tneed it.... [14:47] partial does partial files [14:48] rsync checks for each file as it goes [14:48] yeah well a single .warc is pretty large [14:48] and if it gets interrupted, whole thing has to start over [14:48] yeah true, then your screwed :S [14:52] I've added --partial to btinternet, so the next project will have it too. [14:52] Isn't that going to cause issues as you highlighted earlier? [14:52] No, because --partial-dir keeps the partial files in a separate directory. [14:53] They're uploaded to the .rsync-tmp/ subdirectory and moved when they're uploaded. [14:54] I thought --partial-dir would be enough, but apparently you need --partial too. [14:55] oooo [14:55] heh thats random devs for you [14:59] alard: the title in the btinternet pipeline.py is still webshots [14:59] ;) [15:02] I see. And apparently the title isn't used anywhere. [15:03] Wikipedia produced 933 new btinternet names. [15:04] :D [15:04] I'm searching math stackexchange now [15:04] wikipedia? :o [15:04] alard: stats stackexchange produced "se16" as only username [15:06] it's referenced a *lot* on math. as well [15:06] seems like a pretty important site [15:06] ha [15:06] Think twice before using BT as an ISP. [15:06] on the homepage of that site [15:06] BT used to provide its internet subscribers with a small amount of personal webspace, but did not promote the service so only the oldest most loyal customers used it. Now it now longer wishes to satisfy these customers and is closing the service down. So this page and others of mine, which have received over 2 million hits in 13 years, have to move. [15:06] If your browser does not automatically go to http://www.se16.info/index.htm within a few seconds, you may want to go to the destination manually. [15:06] My conclusion is that if you ever consider BT as a possible ISP for some reason, you should not expect that reason to last. [15:07] yah [15:09] joepie91: We already had it. :) Processed items: 1, added to main queue: 0 [15:12] alright :P [15:12] brb [15:14] alard: Quick question about the warrior: If there are multiple warcs waiting to upload, how does it decide which one goes next? [15:15] LIFO, I think, but if you really want to know you should check here: https://github.com/ArchiveTeam/seesaw-kit/blob/master/seesaw/task.py#L72-107 [15:17] I... have no idea what I'm looking at. [15:18] But since it looks like array manipulation, I'm guessing my request to do smallest file first is a no-go. [15:19] That would be hard, I think. Then the queueing thing would have to know about file sizes. [15:19] And does it really matter? [15:19] Kinda-maybe. It'd free up more threads to download quicker. [15:20] As it is there are times when all my worker threads are waiting for one upload to finish so they can go. [15:20] Of course then you'd have a problem with large files never uploading, but you could conceivably have that with LIFO as well and I haven't seen it happen yet. [15:22] Maybe the upload limit should just go. [15:23] Some people wanted it in the previous warrior. [15:23] I limit the VM, shrug. [15:23] Upload limit, as in throughput, or as inwaiting turns? [15:24] Waiting turns. I think the thinking then was that one rsync uploads faster, so can start downloading sooner. [15:24] The opposite of what you say now, basically. :) [15:24] I can kinda see that, since the overhead for switching wouldn't help overall. [15:24] wasn't it because the upload location was really slow at one point? [15:24] and no one could finish anything :D [15:24] ended up eating all the space on the warriors. [15:25] Is there someplace I can set it to let 2 upload at once, see if there are any wins to be had that way? [15:26] yup [15:26] you running vm? [15:26] I have upto 6 uploads at once. [15:26] Yes. [15:26] ok, on the vm window [15:26] alt+F3 [15:26] OK, log in to the VM. Got that. [15:26] nano -w /home/warrior/projects/webshots/pipeline.py [15:27] ctrl+w [15:27] (Well, I will have that, about 6:00 tonight. can't access theVM from work :) ) [15:27] Ah ok [15:27] I need to do a page on this on the wiki [15:27] But keep going. I'll check the scrollback tonight. [15:29] alard: Dunno what project it was requested for, but webshots may just be a different critter. Large variation in upload sizes. Waiting is probably still good, we just might want to be smarter about the criteria for deciding who's next :) [15:29] But the current warrior wins on simplicity. [15:29] Is it worth removing the limit? [15:29] type LimitConcurrent and hit enter, and change the 1 to 6 (or whatever figure) [15:29] (At least, I think it does. I can read Python about as well as I can read Japanese. (Not at all.)) [15:30] I'll try mine tonight. It may let smaller files squeak out, butit may also take longer because of drive-spinning at either end. [15:32] Word of caution: if you change the pipeline.py in your warrior, you may break future updates. (If git can't figure out how to apply the update to your modified version.) [15:32] heh, i seem to have breoken it anyway ¬_¬ [15:32] still getting no output [15:33] Stop the project, go into your warrior and use git pull to figure out what's wrong? [15:33] Understood. But define "break". Update won't apply, warrior will conk out, house burns down, what? [15:33] I think you can expect the SmileyG problem. [15:34] Ah. [15:34] webserver runs, nothing else does :D [15:34] So you'll have to login, use git pull to figure out what's going wrong. [15:34] And as we're talking about it my 261-meg user finishes:) [15:35] alard, would it work to just delete project and restart warrior? [15:35] alard: I'd vote for keep the limit, but add option to change it. [15:35] SmileyG: Is that worth stopping every warrior? (That's what happens if I push an update. Every warrior will finish its current task and restart the project.) [15:36] primus: That would work. [15:36] alard: can't you just do the update and let them pull it in time? [15:36] Yeah, restarting warriors on this project I think is worse. [15:36] Define "in time"? [15:36] when ever they restart their vm? [15:36] No. They check for updates on github. [15:36] Also, add "Check for updates" button to settings page? [15:36] Heh. Like Windows Update. "Updates to this warrior are now available. Apply? This may require your warrior to restart." [15:36] lol [15:37] where do I run the git pull? [15:37] What we should have, in a future version, is a gradual update. [15:37] cd /home/warrior/projects/$project/ [15:37] (perhaps su -u warrior first) [15:38] hmmm its moanin about the changes in pipeline [15:39] * SmileyG changs it back and git pulls [15:39] It'd probably be an awful bitch, but would the multiple-project stuff be useful for that? So /home/warrior/projects/$project.$version instead? Let one run out while the new one sees threadsdisappear and spins up? [15:39] s/stuff/idea/ [15:40] alard: ok I see the new rsync code... [15:40] need to restart the warrior for web interface to update? [15:41] or is it only set via the code (And won't this then cause git to explode again?) [15:41] :O [15:41] ITS GONE CRAZY [15:41] 15 users and counting on one screen [15:43] There we go... [15:43] that is bonkers when it first starts up [15:43] you just see hundreds of boxes popping up [15:44] alard: I remember - The script to create the 50Gb tars couldn't keep up for fortuneCity, thats why the rsync got limited. [15:54] DoubleJ: Yes, that's similar. (I was thinking it might be better to have the cloned git repo in /home/warrior/projects/$project, as the most up-to-date version, then do a clone to /data/projects/$project.$version before starting a project.) [16:37] Have we killed fos? [16:38] :O [16:39] 2Kb/s! \o/ [16:39] Oh its coming back now [16:40] Planned Delivery Date [16:40] Wednesday 10th October [16:40] Planned Delivery Time [16:40] Between 07:30 and 17:30 [16:40] Wed Oct 10 17:40:33 BST 2012 [16:40] HERP? [17:08] HEY [17:08] yeah the uploads are totally dead? [17:08] primus [17:08] :( [17:08] you've overtaken me [17:08] SmileyG: ? [17:08] 4587520 39% 12.21kB/s 0:09:45 [17:08] [sender] io timeout after 300 seconds -- exiting [17:09] sec [17:09] wtf, mine is dead [17:09] Retrying RsyncUpload for Item jpr.tree after 30 seconds... [17:13] .... brokeyd :D [17:13] alard: did you break something :( [17:21] my rsyncs are dying.. [17:21] rsync: failed to connect to fos.textfiles.com: Connection timed out (110) [17:21] Process RsyncUpload returned exit code 10 for Item andrewjjstanley [17:21] Retrying RsyncUpload for Item andrewjjstanley after 30 seconds... [17:21] rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.7] [17:22] yah [17:22] :< [17:23] they retry, but still its killed all progress :< [17:23] oh [17:23] they run now [17:24] http://isup.me/fos.textfiles.com [17:26] I think this is a SketchCow problem. [17:27] :< [17:27] (The warriors will retry 50 times with 30 second pauses before they fail.) [17:28] :< herp. [17:34] alard: it responds to ping [17:46] alard: se16 0MB << hey look :D [18:21] SmileyG: mmm [18:21] it's probably because he replaced the index page [18:22] joepie91: yeah I figured it might be that. [18:22] well it makes sense, the script forwards you off site. [18:41] fos is currently down-ish [18:41] fyi [18:41] ish [18:41] how can a box be down-ish [18:42] He's mincing words. [18:42] it still pings [18:42] It's down. [18:42] It's superdown. [18:42] VMs at archive have 3 states. Up, nossh/services, and noping [18:43] anyway, yeah, it's turbofucked [18:46] how does tpb fetch Google Books' stuff? does it accept suggestions? http://lists.wikimedia.org/pipermail/wikisource-l/2012-October/001204.html [18:49] wait [18:49] how is rsync still working if fos is down :O [19:13] OKAY HI [19:13] NEED HELP [19:14] https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0 [19:14] OK, that's a listing of all archiveteam projects on archive.org. [19:14] 1. Please see if I missed any. [19:15] (i.e. just browse through the archiveteam set to see) [19:28] haha, I love the item counts [19:28] 26, 70, 29, 3956 [19:35] is IA down? not working for me. [19:39] its not working for me too [19:39] k [19:42] SketchCow: you missed the most famous of all - geocities. [19:45] heh [19:45] okay, maybe a recursive grep through my entire repository folder was a bad idea [19:46] Geocities isn't warc. [19:46] IA is fucked right now [19:46] please leave a message after the beep [19:46] :D [19:46] * chronomex waits for the beep [19:46] boop [19:46] * SmileyG hears helicopters [19:47] But yeah, it's down. Once of the core boxes decided to take a dump all over everything, people are working on fixing now [19:47] ok, I'm not in a hurry [19:47] underscor: wat [19:47] IA went down? [19:48] it's down right now [19:48] we broke it ¬_¬ [19:48] lol [19:49] oh wow [19:50] Can't edit the list, but Cinch is missing. City of Heroes (two items, I think: boards and www). [19:52] Qaudio. [20:04] god I hate efnet [20:05] anyway [20:05] is anyone up for testing a useful script? [20:05] wrote a script that takes a glob pattern, then tries to figure out (from extension) what kind of archive each file is, and prints the decompressed contents to stdout using the appropriate application, without actually unpacking it [20:05] consider it a 'cat' for archives :) [20:14] so like zcat? [20:15] igelritte: you know you can be in multiple channels at once, right? [20:15] igelritte: yeah, most of us are in both [20:16] well, actually, I don't know how to do it with pidgin [20:16] but I think you can [20:16] just /j #channel1 and /j #channel2 [20:16] they open up as tabs [20:16] at least in my pidgin [20:17] yeah, I didn't think about it [20:17] whateve's. I'm here now [20:18] k [20:19] so, tell me more about your structure and how one can plug in. [20:21] Is it some starry-eyed-open-source-free-for-all? Or is there a process wherein you tell a gatekeeper what you can do, what you're experienced with, and then they tell you where you can start helping? [20:22] freeforall. [20:23] I've seen Mr. Scott's presentation at Defcon on how AT is going to save your shit...which sounds good to me...but that doesn't tell me a lot about how the group is organized. [20:23] some people write code [20:23] I appear and make comments [20:23] most people run some sort of downloaders [20:23] godane is ..... well I don't know :D [20:24] There are often projects you can help in by running code written by others, basically volunteering your bandwidth to help out. [20:24] godane is affiliated but mostly works on solo projects [20:24] Those are usually advertised on the wiki and IRC, plus I think there's a mailing list for it now too. [20:24] Unfortunately, I'm not really in a good position at the moment to run downloaders or anything else that requires a 24 hour network connection. [20:24] If you haven't got bandwidth, then you can help with the wiki and possibly coding... [20:25] doesn't need 24hr, it'll work when you can [20:25] upto a point [20:25] joepie91: that already exists as lsar in The Unarchiver, although it's all built-in and not invoking other apps [20:25] I'm following this silly dream about living in Germany which means that my current address is--shall we say--fluid. [20:25] oh wait I'm wrong nm [20:26] keep forgetting unix cat is not the same as apple II cat :) [20:26] Are most people in North America? [20:26] a good number but by no means all [20:26] i'm UK [20:27] I got that from the presentation. Something about a kid of 15 in Australia being threatened with legal action for downloading poetry. [20:27] igelritte: jason is in the gatekeeper role more or less, or cat herder if you prefer [20:27] in order probably US, UK, AU, .eu [20:27] Jason seems to do a lot. [20:29] but there's a lot of empowerment if you see something to just do it yourself [20:29] Well, I can definitely help with the wiki [20:30] when you say, 'coding', what do you mean? [20:30] Programming stuff that downloads stuff [20:30] I have a fair amount of experience with BASH scripting [20:31] what are you guys using to download stuff? [20:31] perfect [20:31] primarily wget [20:31] oh, hold on their solder, my BASH scripting is far from perfect [20:31] DFJustin: The Unarchiver sounds like a comic hero :P [20:31] it's like a real life superhero [20:31] but I have written some stuff using wget to batch download stuff for myself [20:32] the main difference is we use a parameter to wget to have it produce .warc files which are a full record of HTTP headers etc. suitable for going into the wayback machine [20:32] lectures from the opencourse ware project at MIT [20:32] hmmm [20:33] Yes, so if you download anything for archiving, use the --warc-file option (available in Wget 1.14). [20:34] hmmm. It appears that the wget that comes with Ubuntu these days is 1.13 [20:34] at least, so says dpkg [20:35] You'll need to build it yourself then (or grab a newer package). .warc support wasn't added until 1.14. [20:35] for our big multi-user projects we supply a ready-made VM with everything all set up and just a go button to push [20:35] okay [20:35] um, what are warc files and why use them? [20:36] warc is a standardized format for web archives, it includes all the HTTP response data from the server (not just the file contents) so that you can "play it back" with a proxy and duplicate the original site exactly [20:36] You'all are interested in full HTTP headers, or the way back machine? [20:36] interesting [20:37] very interesting [20:37] the main impetus is that it's a requirement for wayback to integrate the data (proper timestamps are a necessity, for example) [20:37] Okay, I can see what you're saying [20:38] everyone grabbed geocities kind of higgledy-piggledy and it's hard to pin down the dates for anything because of filesystems, time zones, modification time vs download time etc [20:39] so the later projects have been standardized on warc [20:39] The Geocities project was quite an accomplishment [20:41] warc is big with the pointy-headed academic world because of formal documentation etc. so that gives us an in with that crowd too [20:41] unfortunately the end user tools for it are not great yet [20:43] I loved Jason's picture of the datacenter where the nine terabytes where housed. It reminded me of this scene from 'Connections'--that interesting spin on discovery and invention that came out in the 70's by James Burke--where he holds up an old tape cartridge and expounds: "this device holds one million characters," in that tone of voice like the audience is supposed to piss themselves in amazement. You then do the math and realize that [20:43] DFJustin: is there a format specification for warc? [20:43] one that is publicly accessible [20:44] ISO 28500 [20:45] CHF 122,00 [20:45] eh. [20:46] DFJustin; anything or any place that *doesn't* want to see the inside of my wallet? [20:46] :| [20:46] obviously, you can google it just as well as I can though [20:46] yes, and I only get drafts [20:47] do I seriously have to pirate a document to figure out what warc looks like [20:47] :| [20:47] I have to say that you folks seem down right Edwardian in your manners. Most of my experiences in chatrooms with techsavy folks have not been so pleasant. [20:48] :D [20:48] Most people suck. [20:48] I think the fact everyone is here because they care about it helps, rather than being here because of "work" or other reasons. [20:49] my suspicion is that the 0.18 draft is the same as the final because international standards move slow but I'll defer that to somone whose head is pointier :) [20:49] I was working on Linux from Scratch a few years back; their IRC...well, let's just say that you need a thick skin. [20:49] I believe the bib-something site has a PDF of a draft of the warc spec. [20:49] The warc people at archive.org assured me that that's what they use. [20:49] And none of those people were there for work... [20:49] http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf [20:50] ah yeah hmm [20:50] That's it. Just change the version header WARC/0.18 with WARC/1.0, or something. [20:50] igelritte: I've been "both" sides of the arguement [20:50] There's also a warc implementation guidelines somewhere. [20:51] alard: the draft is representative? [20:51] * joepie91 really hates 'standards' that you can't just view [20:52] Yes, I believe so. The Heritrix implementation is based on the same draft, so that's something. [20:52] Tell me about it joepie91. I worked in Teleco for years. Any idea what they want for a membership to the ITU? [20:52] http://netpreserve.org/publications/WARC_Guidelines_v1.pdf [20:52] igelritte: not sure I even want to know the amount of digits [20:53] It's pretty gross [20:53] alard: that 404s [20:53] anyhow, I'll use the bibnum one then [20:54] Does it? I just copied the link I put on the wiki months ago. :) [20:54] http://archiveteam.org/index.php?title=BT_Internet C-, needs work [20:54] :D [20:54] http://www.netpreserve.org/resources/warc-implementation-guidelines-v1 [20:54] http://www.netpreserve.org/sites/default/files/resources/WARC_Guidelines_v1.pdf [20:55] thankies [20:55] (It's pretty silly that an "internet preservation consortium" doesn't have stable urls.) [20:55] one of the nice things about WARC though is it's basically human readable, you open it up and bam headers [20:55] so it's reasonably future-proof [20:57] lol alard [21:00] Can't upload images to wiki? [21:00] When you watch Jason's presentation at Defcon, you know that other people are involved and that recruits are needed, but the specifics are still a little vague. I guess that I've spent so much time interacting with organizations by being told what to do that the free-for-all comes off as very chaotic. Still not very sure where I can plug in. [21:00] why didn't I see "upload file" ? XD [21:00] hmm, interesting... http://www.webarchivingbucket.com/ [21:00] igelritte: link to presentation? [21:01] sure [21:02] well our formal projects now are all "run the warrior VM" where we tell your computer exactly what to do [21:02] www.btinternet.com/~catechnology [21:02] www.btinternet.com/~ted.power [21:02] www.dgsgardening.btinternet.co.uk [21:02] www.mstracey.btinternet.co.uk [21:02] cc alard [21:02] it's just that on top of that people have their own archiving side projects that are related to the mission in varying degrees [21:02] joepie91: http://tracker.archiveteam.org/webshots/rescue-me [21:03] alard: webshots? [21:03] shouldn't that be btinternet? [21:03] Oops, sorry, http://tracker.archiveteam.org/btinternet/rescue-me [21:03] :P [21:03] is that expecting urls or user names [21:03] usernames [21:04] 0 items added to the queue [21:04] Thanks for your help! [21:04] lol [21:04] Heh. [21:05] The tracker really appreciates your contribution, it just wasn't useful. :) [21:05] haha [21:06] looks like catarc works well :) [21:06] http://sebsauvage.net/paste/?9e695a09848493ea#Yy3GjmiyMI4bfhUcKv9vahutcX48KTJBHLivJh8l2BU= [21:06] nice regex [21:07] I got that from the presentation. Something about a kid of 15 in Australia being threatened with legal action for downloading poetry. [21:07] I can't remember [21:07] hahahahahaha [21:07] was that bluemax? [21:08] what happened with htat o_O [21:08] joepie91: we conform to the draft fyi [21:09] http://archiveteam.org/index.php?title=BT_Internet <<< wtf is iwth the no description below the imae [21:09] ok, thanks :P [21:09] image [21:09] we being archive.org [21:09] SmileyG: lulu poetry's IT department sent a scary letter to him [21:09] "scary" "letter" [21:10] o [21:11] igelritte: does a video of the defcon presentation exist? [21:11] I can't find it [21:14] SmileyG: The "No description" comes from the image, I think. [21:14] except it has a description :/ [21:16] problems with the archive? I'm getting "rsync: failed to connect to fos.textfiles.com: Connection timed out (110)" all the time [21:16] SmileyG: Oh. Then maybe it's in the template? http://archiveteam.org/index.php?title=Template:Infobox_project&action=edit [21:18] Dark-Star: it's down atrm [21:18] atm* [21:19] ah okay. I'll just leave the Warrior running overnight then. I guess it'll automatically resume the upload later [21:23] alard: ah yeah hmmm :S [21:24] weird because the mobile me one doesn't do it [21:26] right on...I'm not as stupid as I originally suspected [21:26] GNU Wget 1.14 built on linux-gnu. [21:27] I now have the ability to support warc [21:27] though, my dpkg still thinks that I'm working with 1.13 [21:28] It's probably been six months or more since I've compiled and installed anything from scratch. It's funny how quickly you forget that shit. [21:28] igelritte: I don't want to temper your enthusiasm and sense of achievement, but you might want to check if your new Wget includes gzip and SSL support. It's in wget -V, I think. [21:30] well, I'm pretty sure that it does because I kept getting an SSL error and had to dig into why and then install libcurl and libgnutls dev packages in order to get wget to compile correctly [21:30] but I will check [21:30] Ah good, then it'll probably work. [21:30] soultcer: Starting TinyBack for Item [21:31] (Hint: the git clone it's very slow if there's no .git in the repository url: https://github.com/soult/tinyback.git ) [21:32] It is? Damn, I always felt so clever because I had to type 4 characters less [21:32] well, right under the version number, you get the following list: +digest +https +ipv6 +iri +large-file +nls -ntlm +opie +ssl/gnutls [21:32] http://tracker.tinyarchive.org/v1/ <-- "ranking" [21:33] soultcer: It's strange, because it does seem to work, but it just takes a long time. I was wondering what my warrior was doing. [21:33] I'm not sure about the 'wget-V, I' syntax...is that supposed to be 'wget -V -I'? [21:33] or really a comma [21:33] Heh. The comma and I are part of the sentence. :) [21:33] * igelritte laughs at self [21:34] igelritte: if you're interested in downloading you can download ArchiveTeam Warrior virtual machine - it has everything already set up. http://archive.org/details/archiveteam-warrior [21:35] To check if you have gzip support, use: wget --help | grep warc-compression and see if it returns something. If it does, it works. [21:35] I'm a little limited on what I can do with downloading at the moment. This network connection is not really my own. [21:36] igelritte: does a video of the defcon presentation exist? <-- https://www.youtube.com/watch?v=-2ZTmuX3cog [21:38] alard: I get the "no-warc-compression"; I'm guessing that warc uses gzip for compression [21:38] ? [21:40] Then your Wget is in top condition. The thing with gzip is: you can make .warc and .warc.gz files. It is much better to do the gzip compression in Wget than to do it afterwards. Wget makes a new gzip record for each downloaded file, so it's possible to extract only part of the .warc.gz. If you use the gzip utility to compress your warc afterwards, you can only decompress everything at once. [21:43] Just performed a quick little test where I ran the following: wget --warc-file test http://en.wikipedia.org/wiki/Jason_Scott_Sadofsky. This seems to have created the 'test' file that I asked for. [21:43] -rw-rw-r-- 1 23386 Oct 10 23:41 test.warc.gz [21:44] quick question to alard: how does one write a setup.py where the resulting install package will copy a python file to the bin directory? [21:44] /usr/bin etc [21:44] gunzip -c test.warc.gz to look inside [21:45] Why do you think I would know? I'm a copy-paste setup.py writer. :) [21:45] scripts, I think: https://github.com/ArchiveTeam/seesaw-kit/blob/master/setup.py#L41-44 [21:46] well, seesaw does it :P [21:46] and alright, thanks [21:46] I thought you were the python distribution / pip / pypi expert. :) [21:47] very interesting. That seems to have worked. I DO have an HTTP document. It doesn't look anything like a wiki, but I'm guessing why I know that is. [21:48] alard: oh, not at all [21:49] I just know how to package up a module with an existing setup.py [21:49] :P [21:49] and that's it [21:57] so, when I unpack this archive file (warc) I should expect to find nothing put pure HTTP? [21:57] You'll find warc records, some of which have a HTTP body. [21:58] hmmm [21:58] You get some warc headers identifying the record (type, target-uri, timestamp etc.), then the http request or response. [21:58] There are special types of warc records with metadata, such as the wget command line and log. [21:59] So it's not the most user-friendly format, you need to work to get the data out. [21:59] The good thing is that everything is in the file, so you *can* get it out. [22:00] This is all just for my education; so, feel free to tell me to fuck off when you lose patience. But, where can I find these headers? When I open the file with a text editor, it spears to be just HTML. [22:01] You'll have to look better then, they're in there. [22:01] It starts with WARC/1.0 or something, then there's WARC-Target-URI, etc. [22:04] Hey, so my commentary before. [22:06] It has scrolled away. :) [22:06] SketchCow: http://archive.org/details/archiveteam-city-of-heroes-www is not on the list [22:07] crazyness...I just used vi on the test.warc.gz file and the headers you mentioned showed up. Vi also showed me all the compressed content. I didn't know that vi could do that... [22:07] SketchCow: geocities - theres a dump on the ia but I can't find it anymore (and it was searchable.... we really need to make those links more accessable...) [22:08] http://archive.org/details/archiveteam-qaudio-rescue [22:08] http://archive.org/details/archiveteam-cinch [22:08] wait wait wait wait, what? Jeroenz0r is/was part of urlteam? [22:09] Only WARC items. So Geocities proceeds that. [22:10] ah k [22:12] Perhaps I'm really thick here...and that wouldn't be a surprise...but I'm still not seeing how I can contribute. Is there a list of "shit that needs to get done and we'd be thrilled if you'd take it on" some where? [22:12] Both added, alard [22:12] What's your skillset, igelritte? [22:12] various godane grabs(tm) at https://archive.org/search.php?query=warc%20uploader%3A%22slaxemulator%40gmail.com%22 [22:13] There are some groklaw.net warcs: http://archive.org/details/groklaw.net-pdfs-2004-20120827 [22:13] Well, I've done some BASH scripts. I'm trilingual. I've done lots of networking. [22:13] And there's a bunch of voip in there too [22:13] http://archive.org/search.php?query=groklaw%20warc [22:14] igelritte: is there any chance you can turn the install script for the webshots script, into something more sane? [22:14] because I suck at bash :P [22:14] I'm not that awesome at it either, but I can look at it. [22:14] current script is at http://cryto.net/projects/webshots/webshots_debian.sh [22:14] http://archive.org/search.php?query=warc%20journalstar (but it's getting more obscure now) [22:14] thanks :) [22:16] Hmmm... [22:16] joepie91: you can set a trap on error to avoid all the conditionals [22:16] this could use some commenting and perhaps a header [22:16] and then have it print "Error on line x". Not as nice of a message though. [22:17] who wrote this? And why are they doing an apt-get at the beginning? [22:17] igelritte: I did [22:17] and the apt-get is to install dependencies [22:18] http://archive.org/search.php?query=uploader%3A%28slaxemulator%40gmail.com%29%20AND%20warc [22:18] is there an echo in here [22:18] I think I see what you're doing here, and I understand why you would do an apt-get update before doing an install [22:18] DFJustin: Oh, sorry. :) [22:18] but, I don't think I understand enough of the purpose here to understand why you would do that in a script [22:19] igelritte: it's apt-get update, not upgrade [22:19] just updates the package list [22:19] I'm guessing that my ignorance is to blame [22:19] right [22:19] joepie91 / igelritte: here's a nice article on BASH traps, btw. http://phaq.phunsites.net/2010/11/22/trap-errors-exit-codes-and-line-numbers-within-a-bash-script/ [22:19] typo on my part [22:19] had it break for some people because the package lists weren't up to date, so that's why update is there :) [22:21] joepie91: also, why are you using useradd? On Debian, you're supposed to use the adduser command afaik [22:21] adduser is interactive [22:21] Doesn't have to be [22:21] At least, I think you can make it a one-liner [22:21] iirc I haven't found a way to make it not interactive [22:21] :P [22:22] anyway, any particular reason not to use useradd? [22:22] Does useradd make the home directory? [22:22] yes [22:22] o [22:22] Welp, adduser just follows a nice configuration file that specifies things like the permissions to set on the home directory among other things [22:23] But I guess useradd works OK. I was just curious. :-) [22:36] SketchCow: there are more qaudio items, http://archive.org/details/archiveteam-qaudio-archive-1 through http://archive.org/details/archiveteam-qaudio-archive-7 [22:39] also fan fiction http://archive.org/search.php?query=%22fan%20fiction%22%20archiveteam [22:41] right [22:41] pip install catarc [22:41] :) [22:41] cat for archives [22:48] OK, so I got out of a meeting about incorporating archive team stuff into wayback [22:48] NATURALLY it's slightly more complicated in some cases. [22:49] Let me make some changes to the thing. [22:52] of course it is [22:52] what kind of changes do they want? [22:55] Look at the document again. All green ones are cleared for takeoff. [22:55] wow, awesome [22:57] so looks like they can just suck in warc-in-nothing, yes? [22:59] Yes [22:59] They cannot suck in warc-in-archives [22:59] So, next step is to look at the archives ones and see if there's not too many WARCs in it, say less than 100 [22:59] I mean "just suck in" as in "point the ingestor at" [22:59] good thing we didn't upload 250tb of that XD [23:00] lol yes [23:01] mobileme: 280T of .tar containing .warc.gz [23:01] soooo [23:02] We're aware of it and there'll be a project to deal with that. [23:02] But I don't want to rush it. [23:02] So Brewster's letting me make doubled files for weird ones. [23:06] even if there's a shitload of warcs inside they can all be cat-ed together into one megawarc right [23:07] is there a webshots tracker I can check the progress? (I'm unable to help, I'm just curious how it's going) [23:07] http://tracker.archiveteam.org/webshots/ [23:07] thank you :) [23:08] underscor making his isp cry again [23:08] YEah, but the machine is still down [23:08] so I don' know what's going on [23:08] DFJustin: Yes, exactly. [23:13] alard: what about an 'assorted' warrior project [23:13] with things that are small or heavily rate-limited (like some urlteam targets) [23:13] that the warrior automatically switches to whenever it has nothing else to do [23:14] that sounds cool. [23:14] for example, if the current selected project is done [23:14] a "let's not waste any time or bandwith that we have" mode, so to say :P [23:14] urlteam is a basically-no-bandwidth project, it might actually make more sense to run it in the background always. [23:15] maybe have an 'always running' *and* 'assorted' project [23:15] yeah [23:15] separate projects... one always runs, like urlteam [23:15] and assorted is filled with whatever small project is happening that doesn't warrant its own separate project, really [23:15] 'assorted' would be filler for "let archiveteam choose" [23:15] as a fallback when it has nothing better to do [23:15] well yes, but the thing is [23:16] say that I've got it configured for btinternet [23:16] the moment btinternet is done, which will be soon [23:16] my warrior will be bored out of its skull, no? [23:17] yes [23:18] would be good if it switched to 'assorted' then :P [23:18] 'let archiveteam choose' has a pretty different function [23:18] that option should always refer to the most urgent project [23:18] such as, in this case, webshots [23:18] assorted would have the stuff that isn't really urgent or significant, but has to be done anyway [23:18] at some point in time [23:21] ah [23:36] hi, is fos.textfiles.com down? [23:36] it is [23:37] rsync will happily retry until it reappears, right? [23:37] if I recall correctly, it will retry 50 times [23:37] before giving up [23:37] alard can probably confirm on that [23:37] :( 50k link user in queue [23:37] Fortress of Solitude is Back [23:37] ouch [23:37] oh, it is? [23:38] SketchCow: my warrior disagrees [23:38] rsync: failed to connect to fos.textfiles.com: Connection timed out (110) [23:38] same here, but i guess it will work soon then :) [23:39] probably we are just hammering it currently [23:39] and thx for the info! [23:39] aaaaand there it went [23:39] :D [23:41] Hooray, 517 rsync connections. [23:41] lol [23:41] working for me now too :) [23:42] :| [23:42] uploads just died [23:42] like, literally flatlined [23:42] ah, it resumed [23:42] and flatlined again [23:42] wat [23:43] alard: you wanna run through the usernames in these https://en.wikipedia.org/wiki/Wikipedia:Bot_requests#btinternet [23:43] so, from the following, I can assume that fos = fortress of solitude and that this is some place where folks are trying to rsync there current downloads to. Feel free to direct me to a link that will shut me up. [23:43] *thier [23:43] or maybe their [23:44] igelritte: yes, fos is where the uploads go [23:44] At some point grammar will come bck to me [23:44] until then [23:44] indeed [23:46] phew... seems like some 1 gb stuff is in queue on nooon [23:50] DFJustin: http://pastie.org/5032511 [23:51] is the clean version [23:51] of all usernames for both .com and .co.uk [23:51] sorted, unique [23:51] also cc alard, idk if that list is already in the tracker [23:51] k, time to sleep [23:51] goodnight all :) [23:51] nice thanks [23:59] Well, FOS is getting CRUSHED, we'll see how long this lasts. [23:59] 848 Rsync collection [23:59] lol