[00:06] PatC: welcome! [00:06] Thank you! [00:07] hi guys, any chance someone could get in contact with this guy: http://libregraphicsworld.org/blog/entry/guitar-samples-in-gig-format-from-flame-studio-collection-shared and help him get those items on archive.org? [00:25] dashcloud: Think uploading them and then just sending him a note is sufficient? [00:26] Since we can download straight from the torrents [00:26] sure [00:26] thanks! [00:27] the torrents have very few seeds [00:31] Yeah, you're right, these are pretty underseeded [00:34] Paradoks, I have to finish archiving some files from good.net, (another week or so) then I can start helping with MobileMe [00:38] PatC: Great. Thankfully, we have a bit of time with MobileMe. It's a ton of data, though. [00:38] ya, isn't it 200+ TB? [00:39] Yeah. Though Archive Team seems to deal in fractions of petabytes at times. Though less so with the current hard drive shortage. [00:40] Definately [00:40] I was thinking of picking up a few 2tb drives, but then that whole thing happened, and I cant get one for less then $150 [00:43] As someone mentioned previously, hopefully this isn't like the RAM shortages of yesteryear. I hope Thailand recovers quickly. [00:43] ya.. [00:51] is this a waste of time wget -mcpke robots=off -U="googlebot" www.fanfiction.net [00:52] im trying to grab all of it, and id like to know if this is the right way to go about it [00:53] well, you should get the googlebot UA right. also, warc support would be nice, [00:53] so how do i do wars, bc i have it and just cant figure it out, and whats the google ua [00:53] (or kK) [00:54] well, googlebot 2.1 is: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) [00:55] "ARCHIVEBOT FUCK YOU ALL" always works great [00:55] and if you have wget-warc (which you do if you helped with splinder), check the --help, as it has the warc options [00:55] and the old standby "EAT DELICIOUS POOP" [00:55] so *thats* where the warc manual is [00:56] also is the syntax correct for setting the useragent [00:57] no. it is -U "your preferred UA string here" [00:57] no = [00:57] thanks [00:58] you use the = if you use the long format of --user-agent="your ua string here" [00:58] ok thanks that was really confising in the wget man pages [00:58] now whats the wget warc command ia would like [00:59] May I ask... what is warc? [00:59] it's an archive format that archive.org uses, contains a full description of a http connection [00:59] ah [00:59] request/response body, all headers, etc [01:09] should i include cdx files too [01:20] guys, I told the splinder scrips to stop, but they haven't stopped yet- is there a quicker way to force them to quit? [01:21] control-c? [01:21] how much do you care? [01:22] I don't want to lose anything, but I'm getting complaints about bandwidth from other people on my connection, so I'd like to get them finished up quickly [01:25] how many are you running? :) [01:26] not that many I thought (150), but enough to seriously impact the wireless network in the house [01:27] ah, wifi. wifi doesn't like archivism. [01:27] so should i include cdx files as well [01:29] is cdx also an archive.org format? [01:31] I have never heard of cdx before. [01:32] neither have i, its a warc option., so thats a no, then? [01:39] hows this for the command wget-warc -mcpkKe robots=off -U "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --warc-file=www.fanfiction.net/warc/warc www.fanfiction.net [01:42] thw upshot of this method is that the stories themselves can be had from a linklist made up of the fiile paths of nearly every file in the folder, because by some miracle, thats how their being saved [01:59] bsmith093: yes, do the cdx file [02:00] bsmith093: then, the next time you run this command wget will skip saving things to the warc if they already exist (and have the same content) [02:23] I'll admit I'm only passingly familiar with Google Knol, Wave, and Gears. Are they the sorts of things we can usefully archive? [02:23] gears, no user content. [02:23] knol, dunno. wave, certainly. [02:24] I'm not sure what the utility of google wave data will be sans google wave [02:24] Doesn't wave have privacy settings? [02:25] that too. [02:25] Are things open by default? because chances are people will leave default security settings [02:28] 17:15 -!- Irssi: Starting query in EFNet with anonymous [02:28] 17:15 May I have an rsync slot? [02:28] 18:02 May I have an rsync slot? [02:28] 18:48 -!- anonymous [webchat@216-164-62-141.c3-0.slvr-ubr1.lnh-slvr.md.cable.rcn.com] has quit [Quit: Page closed] [02:28] 18:48 May I have an rsync slot? [02:28] 18:52 May I have an rsync slot> [02:28] 19:18 -!- anonymous [webchat@216-164-62-141.c3-0.slvr-ubr1.lnh-slvr.md.cable.rcn.com] has quit [Ping timeout: 260 seconds] [02:28] That's just not a good way to do it. [02:28] :/ [02:28] It's called, Jason was out shopping [02:29] Oh, your Jason S. ? [02:29] Yep, my Jason S. [02:30] jason with the cat. [02:31] SketchCow, I was 'Pat' from the Google hangout a few nights ago [02:31] [20:33:09] Wierd [02:31] oops [02:31] Hey [02:31] (putty right click paste sorry) [02:31] SketchCow, I got an external hard drive dock, as you suggested. Very nice to have [02:32] Excellent [02:34] pity about Wave. Bet there's some very interesting and historic content in there [02:34] Agree [02:34] Also putty: My weblog has gotten the white screen of death [02:34] No idea why [02:36] Have you guys seen these little storage boxes? http://usb.brando.com/hdd-paper-storage-box-with-cover-5-bay-_p00962c044d15.html [02:36] They seem to be a good, cheap way of storing hard drives [02:37] Knol sounds like it has content, dunno if it's publically available [02:56] SketchCow, what is your blog url? [02:59] http://wayback.archive.org/web/*/http://ascii.textfiles.com [03:06] Thank you closure [03:08] that is one way to get there [03:11] this is the command im now using ben@ben-laptop:~$ wget-warc -mpkKe robots=off -U "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --warc-cdx --warc-file=www.fanfiction.net/warc/warc www.fanfiction.net [03:12] and this is the current output Opening WARC file `www.fanfiction.net/warc/warc.warc.gz'. Error opening WARC file `www.fanfiction.net/warc/warc.warc.gz'. Could not open WARC file. [03:12] bsmith093, that is the filepath? [03:13] yes i deleted the stuff wget had akready downloaded for the sake of doing it right, and recreated the www.fanfiction.net folder for the warc file [03:13] oh wait never mind hold on [03:14] yeah it cant find the warc file because its not there because its supposed to creat it? [03:16] if you 'ls' in the www.fanfiction.com/warc/ folder is there the warc.warc.gz file? [03:18] no its empty [03:18] i havent stared the job yet, and this only happened after i added the --warc-cdx option [03:19] ah [03:19] w/o it its fine [03:19] do i do the cdx later after the job is done as an update to it [03:19] I'm not sure what --warc-cdx does so I can't help you with that, sorry [03:21] db48x2: do i run without the cdx option for th e first donload and then everytime after with the cdx to update? [03:25] im an idiot fanfiction.net not .com had the wrong filepath, all fixed now, runs great with cdx option set [03:25] ah, what would do it [03:36] Wait, i'm sorry, I was the person who mentioned .com [03:49] Works [03:51] nice! [03:51] What was the problem? [03:57] bsmith096: use the cdx option every time [04:01] I overlaid a new install of wordpress [04:01] THANKS WORDPRESS [04:01] THE RESET YOUR ROUTER OF REPAIRS [04:08] SketchCow, congrats on your kickstarter! I can't wait to see the end result :) [04:14] SketchCow: does the wget warc create the warc file at the very end, because im running one of my own download jobs, and its not creating it yet [04:21] alard can tell you [04:21] I don't know actually. [04:27] alard: does the wget warc create the warc file at the very end, because im running one of my own download jobs, and its not creating it yet [04:28] documentary question, how did you decide to focus on the 6502 chip? [04:30] bsmith096: in wget 1.13.4-2574, the WARC is built as each file is downloaded [04:30] if you have a different version, the behavior may be different [04:31] bsmith096: imo, 6502 is the most widely used 8-bit processor for hobbyists during a certain time. [04:31] it also retains some lasting appeal ... the z80 is still used but is still used industrially and in gameboys, etc, so it's different. [04:33] yipdw^: but the file doesnt exist yet, so is it builing it in ram or something and then wirting at the end, because the path where i told wget warc to put the warc file is currently empty [04:34] bsmith096: dunno. maybe building an index file changes the behavior [04:43] no [04:43] i think it may have failed to open the warc file because the directory didn't exist [04:48] oh [04:48] or that [04:51] also, may I suggest a better base filename than "warc" [04:52] (since it will have .warc or .warc.gz added to the end already) [04:53] also, I have archived Kohl's terrible Black Friday ad, in the event that they develop a sense of shame [04:53] may I suggest something like "www.fanfiction.com_20111122" or the like [04:53] oh? [04:53] I've not looked at any BF ads. [04:53] http://www.youtube.com/watch?v=vGiQzPi0f_E [04:53] a friend linked me to it [04:53] I feel sad [04:54] her, SHIT [04:54] * PatC shreds his screen [04:54] er... [04:55] after the related vids loaded, along with the description, the pieces fell together... [04:55] and THEN the video started playing [04:57] I didn't get the comments or the video info page, though [05:01] not completely positive, but I think Rebecca is the woman in the red. the one getting into the store after the woman singing, and the one she steals the item from and then flicks off [05:01] baha. did you catch the very end of the video? [05:01] Coderjoe: apparently the warc file has to be in the folder ithe wget is saving to not under it, also took your advice and renamed it, and now restarted the job, and the war and cdx now exist so whoo! [05:02] it can be in a different directory, but the directory has to exist at the start of the job [05:02] well its working now so yay [05:04] it's running beautifully [05:37] okay guys. [05:38] http://colour-recovery.wikispaces.com/Full+gamut+colour+recovery these guys are working on recovering color tv [05:38] starting with scans of film that was exposed in a machine that records tv onto film [05:38] they're re-rectangularizing the frames and extracting color from the dot crawl you get on b/w tv [05:38] it's fucking impressive. [05:43] I downloaded all the the ROFLcon summit videos. [05:43] Including myself. [05:43] http://www.archive.org/details/roflconsummit-cpw [05:52] Brewster and you?! [05:55] chronomex: someone was telling me about that, restoring color to old kinescopes of doctor who [05:55] yeah [05:55] it's cool shit. [05:56] modern signal processing is kind of magic [06:02] i stil mildly dislike the old heads at the beeb. the ones that decided to destroy their archives. [06:02] they had reasons [06:11] Oh, don't justify them. [06:27] I didn't say they were good reasons [06:27] but they didn't just say HURRDURR BALEET [06:37] Someone did [06:37] I believe it was TALLY BALEET [06:37] lol [06:45] this is horrible. I'm moving into a smaller space and have to winnow :( [06:45] Noooooo [06:46] is why I have some understanding for what happened at the BBC [06:46] mostly I have a bunch of paper detritus [06:55] SketchCow: can I get an rsync slot? [07:28] what does it mean if I have over 110 instances of the splinder downloader running but only 70 instances of wget? [07:29] there can't be 40 in the process of parsing the profile pages because there's no such python process currently [07:45] Nemo_bis: how high is your load avg? what does your memory usage look like? there can be a number trying to parse if there is heavy disk IO, or you're short on ram and not able to cache stuff for log, or swapping, etc [07:50] this is funny; we're down to the bigger profiles (not counting the ones alard has not put back into the queue) [07:54] hah [07:54] it:MagicaHermione [07:56] Coderjoe, I'm using all memory and disk load is very high, but there's nothing to parse, it's been downloading the same big users for days [07:58] there seem to be a small amount of extreme splinder users [07:58] weird [07:59] just had the same downloader-user pair show up twice on the dashboard [07:59] yeah I saw that too! [07:59] Nemo it:Luley 0MB [07:59] Nemo it:Luley 0MB [07:59] hm [08:00] I was just looking at that user [08:01] uh, there's some user being downloaded at 10 Mb/s, what a joy [08:04] machine is too slow to open browser now, sorry if I flood the channel [08:04] - Downloading profile HTML pages... done. [08:04] - Parsing profile HTML to extract media urls... done. [08:04] Deleting incomplete result for it:Luley [08:04] Downloading it:Luley profile [08:04] it:Luley contains 502 or 504 errors, needs to be fixed. [08:04] - Downloading 4 media files... done, with HTTP errors. [08:04] - Checking for important 502, 504 errors... none found. [08:04] - Result: 134K [08:04] rm: impossibile rimuovere "data/it/L/Lu/Lul/Luley/.incomplete": File o directory non esistente [08:04] Telling tracker that 'it:Luley' is done. [08:04] blogs.txt media-urls.txt splinder.com-Luley-html.warc.gz splinder.com-Luley-media.warc.gz wget-phase-1.log wget-phase-2.log [08:04] ls data/it/L/Lu/Lul/Luley [08:05] you happened to have two threads doing the same user? friggin awesome :-\ [08:05] no, I don't think so [08:05] it would explain the double-complete and that rm error [08:05] I think that for some reason it marked it complete two times, the second failing to remove the .incomplete (because already deleted) but telling to tracke two times [08:06] (one thread deleted the .incomplete file before the other) [08:06] perhaps due to disk load it queued the "remove and tell to tracker" two times? (I don't know how it works) [08:07] no, it wouldn't have done that [08:07] ah, you're right [08:07] unless there were two dld-single.sh instances working on the same user at the same time [08:07] hm, this shouldn't happen [08:09] no idea how that would have happened unless you stopped dld-streamer.sh with ^c and didn't actually stop all the children, and then started downloaders through another means [08:09] no, I'm running two instances of fix-dld [08:09] oh [08:09] don't do that [08:09] because there are too many users to fix and it's very slow, and they shouldn't conflict [08:10] they will conflict [08:10] but the first which finds a user adds .cinomplete and the second ignores it [08:10] it's always worked :-/ [08:11] only because you were lucky before on the race conditions [08:12] but how can they conflict, doesn't the fixer add the .incomplete mark as soon as it starts working on a user? [08:12] yes, but if the one got past that check just before the other makes the file, they will BOTH think they can work on that user [08:12] like I said, RACE CONDITIONS [08:13] checking for the file and creating the file are not atomic operations [08:13] yes, I understand, but... I thought that writing an empty file wouldn't be a problem. It happened 4 times though, stopping now... [08:14] Before us.splinder.com went down I was doing one for it users and one for us user [08:15] fix-dld doesn't have a means of specifying which country to work on [08:16] I just modify the script [08:17] if you look at the script, there is a rather large time gap between checking for .incomplete and creating it [08:17] (the grep) [08:18] ah [08:19] indeed [08:20] although in this was it checks for way more .incomplete than needed and it saves just a few greppings [08:21] eh? I can't quite parse that. (I should also probably go to bed soon as well) [08:24] if you wanted to make it more parallel, the way to go about it would be to write out the list of need-redo users to a file and then (carefully) modify dld-streamer.sh to read from that file instead of asking the tracker. (but do not run that dld-streamer until your fix script is done identifying all users to redo) [08:25] (I think this approach is a bit easier and less error prone than trying to bring in all the job control logic from dld-streamer into fix-dld) [08:29] yes, but I don't have *that* many users to fix [08:30] right, but you wanted to run the downloads in parallel. fix-dld was not written to support running multiple instances on the same data directory at the same time [08:30] what I meant is that you usually put the less expensive and more effective condition first, so it makes sense that fix-dld checks for .incomplete as first thing, but it's not actually needed if the user doesn't need to be fixed [08:31] yep, I just gave up, I'll be patient :-p [08:31] the check for .incomplete was to allow fix to be run while downloaders are also running [08:31] (as stated in the comment above the check) [08:31] so perhaps it could be more efficient to check for the .inomplete after grep [08:31] yes, I meant ^ [08:32] is there a way to create a new git repo and upload files using the website at github? [08:32] wait nevermind [08:32] i forgot i published it already ( https://github.com/SpiritQuaddicted/sourceforge-file-download ) :D [08:35] hm, moddb might be a worthy target [08:44] http://ascii.textfiles.com/archives/3395 [08:47] SketchCow: you left his mail address in once, dunno if that was intentional [08:49] Un. [08:49] Fixing. [08:50] Fixed. [08:50] * Schbirid posts unredacted version to archive.org [08:50] just kidding [08:52] hm. google suggests that someone with that email address is into penny stocks [08:52] interesting. [08:53] this makes me think of some emails in my inbox... :"( [08:53] oh well, gotta go [08:54] Once I sent an email in 2004 and got a response in 2009. [08:54] These things happen. [08:54] actually 2008 [08:54] those things rock [08:54] so I waited to 2011 to reply back. [08:55] sounds like my astonishing talent of conversation with girls i fancy [08:55] ? [08:56] gets hit on in 2004, realises in 2008 [08:56] :D [08:56] smrt. [08:56] Sleep with in 1996, realize she was using you in 2007 [09:01] ) [09:01] :) [09:02] This was just the therapy to be able to reach some sort of thing with this guy. [09:02] Send him his drive back. [09:02] be done with it. [09:02] Now, he percieves a crime worthy of monetary damages. [09:02] That's interesting. [09:02] I'd like to see that court case. [09:11] opera makes it much too easy to collect too many open tabs over the week [11:41] google knol is closing [11:43] indeed [11:43] can we save it? [11:43] DUDE. [11:45] hmm [11:45] firefox just crashed [11:45] it doesn't usually do that [11:45] http://googleblog.blogspot.com/2011/11/more-spring-cleaning-out-of-season.html [11:48] Google is now on the Archive Team black list. [11:51] well, we have five months [11:52] google joins yahoo in the Bad Decision Club [12:15] hi! does anyone know how to get curl to play nicely with Yahoo!'s login screen? [12:16] hm. yahoo's login screen barely works for me in a real web browser :P [12:16] I've got as far as trying the following: [12:16] curl -c cookie.txt -d "login=zoeblade&passwd=foo&submit=Sign In" https://login.yahoo.com/config/login [12:16] curl -b cookie.txt -A "Mozilla/4.0" -O http://launch.groups.yahoo.com/group/foo/message/[1-23217]?source=1 [12:16] I'd suggest looking at the program 'fetchyahoo' which downloads mail from a yahoo account [12:16] that has some rather robust login code [12:17] but although it does save a cookies file, it keeps on trying to redirect me to the login screen, so presumably I haven't logged in correctly [12:17] or it did when I used it few years ago :] [12:17] good idea, thanks [12:17] hm. [12:17] good luck! it's 4am and I should have been in bed hours ago :| [12:23] sweet dreams ^.^ [12:32] Hey people [12:37] hi [12:37] hello Hydriz [12:39] How is the day? [12:39] Just seen the Knol news [12:41] the day is early [12:42] lol [12:43] just feel like jumping in and start archiving Knol [12:43] just love the archiving feeling [12:46] great [12:46] the place to start is by exploring the site and seeing how it's organized [12:46] can we download things by user, or by some other enumerable index? [12:50] looks like it is sorted by user? [12:50] or maybe... [12:51] Lets take an example: http://knol.google.com/k/scott-jenson/scott-jenson/6b7e08nms1ct/0#knols shows knols the user created [12:52] excellent [12:59] are you writing the script now? [13:02] no; I'm far too tired to be of any use there [13:03] LOL [13:44] Funny how the Knols he wrote is that they've been viewed strangely [13:44] 27k times, 3k, 5k [13:44] nah, google estimates on that sort of thing all the time [13:45] number of search results, word counts in books, etc [13:45] those bastards [13:45] all estimates posing as real numbers [16:56] HEY IT IS THE FAMOUS ZOE BLADE [18:01] SketchCow: hey, I think you left out your own talk on http://www.archive.org/details/roflcon-summit [18:03] yay, a download! [18:06] It's there, but you have to click on "all" [18:10] doh! one day this will have a better interface, I'm sure [18:14] on the plus side, that's another successfull transaction, and I didn't even blackmail you [18:40] hah [18:48] wow [18:49] I just finished a 243MB user [18:52] I have users who have been going for 3 days now. [18:54] hmmm [18:55] you know what might be handy? a ramfs/tmpfs-disk hybrid filesystem [18:56] you set a maximum ram usage and it keeps recently used/accessed stuff only in ram. when that ram space is exceeded, stuff that hasn't been touched in awhile gets flushed out to a backing directory on a hard drive [18:58] this way, you get the benefits of tmpfs/ramfs for the files that get downloaded and grepped, but don't potentially fill it up with additional stuff that is downloaded and then not touched [19:01] and is better than just the disk cache over a normal filesystem because stuff doesn't even hit rotational media until it gets evicted from ram [19:09] i just got back, google knol is going away? [19:12] yea, I had 26 users still going when I left this morning [19:12] bsmith094: yep [19:13] so is splinder done yet? fully? [19:13] nope [19:13] us.splinder.com is still down [19:13] so those users are out of circulation [19:13] not done, beause the tracker for the streamer script is down [19:13] ?? [19:14] and we all need to check our data for incomplete or broken users and finish them [19:14] bsmith094: the tracker isn't handing out any us users, so the script will see it as done [19:16] ok i just spent 14 hrs doing a wget of fanfiction.net, and why fo i have blah and blah.orig [19:16] cool. how big is the result? [19:17] Hi dear friends! Can anyone of you with Windows operating system and Internet Explorer web browser help me just for a few minutes? I'm dealing with a bug in a file of Project Gutenberg and I don't use Windows/IE, I'm on a GNU/Linux box and online IE rendering tools is not helpful for my case. I want to learn what you see on page www.gutenberg.org/dirs/etext06/8loc110h.htm under the expression "digestibility will be acquired." (without quotes). An [19:17] du -ach isnt done yet [19:17] image? Or some text? What? Thanks in advance! [19:17] bsmith094: :) [19:17] so whats the .orig files and can i get rid of them? [19:18] marenostr: this isn't really the right channel for that kind of question [19:18] bsmith094: dunno. check your wget log and see when it created them [19:18] maybe the server just had files with that name on it [19:18] where would the log be [19:18] where did you save it? [19:18] db48x, OK. Sorry. Googling gave me that impression. Sorry. [19:19] wget wwwfanfiction.net [19:19] hrm [19:19] marenostr: what specifically did you see that gave you this impression? [19:19] bsmith094: did you specify a -o or -a option? [19:23] db48x, On page http://archiveteam.org/index.php?title=Project_Gutenberg , right side, at the bottom of the box says. IRC channel: #archiveteam and this is meant for Project Gutenberg. It gave me that impression. [19:26] db48x2: i used this wget-warc -mpkKe robots=off -U "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --warc-cdx --warc-file=www.fanfiction.net/fanfiction_20111122 www.fanfiction.net [19:26] bsmith094: so wget sent it's log to stdout [19:27] bsmith094: did you redirect stdout to a file? [19:27] for example: wget ... www.fanfiction.net > archive.log [19:27] errr... no, but in hindsight that would have been a good ides [19:27] indeed :) [19:27] Evening folks [19:28] marenostr: ah. Archive Team is all about archiving webpages, and one of the webpages we have archived/are archiving is Project Gutenberg :) [19:28] PatC: howdy [19:28] to give an idea of how large this site is for being mostly text, the warc file is 850mb [19:31] doesnt PG have a problem with robots bulk downloading [19:33] my disk currently hates me, so im gonna let du -ach run afor a while [19:33] Pulling 2MB/s off archive.org I didn't know their internet connection was this good / not limited [19:34] Good/unlimited internet connections are 2 GB/s nowadays [19:35] wow! [19:39] bsmith094: if you used -k and -K the -K means back up the original files before changing the links. (which isn't really needed if you were doing a warc as well, since the warc would have the original data the server sent) [19:40] also, if you let wget run to completion, a copy of the log should be in the warc file [19:40] ah, right, -K [19:44]