#archiveteam 2011-11-23,Wed

↑back Search

Time Nickname Message
00:06 🔗 db48x2 PatC: welcome!
00:06 🔗 PatC Thank you!
00:07 🔗 dashcloud hi guys, any chance someone could get in contact with this guy: http://libregraphicsworld.org/blog/entry/guitar-samples-in-gig-format-from-flame-studio-collection-shared and help him get those items on archive.org?
00:25 🔗 underscor dashcloud: Think uploading them and then just sending him a note is sufficient?
00:26 🔗 underscor Since we can download straight from the torrents
00:26 🔗 dashcloud sure
00:26 🔗 dashcloud thanks!
00:27 🔗 bsmith093 the torrents have very few seeds
00:31 🔗 underscor Yeah, you're right, these are pretty underseeded
00:34 🔗 PatC Paradoks, I have to finish archiving some files from good.net, (another week or so) then I can start helping with MobileMe
00:38 🔗 Paradoks PatC: Great. Thankfully, we have a bit of time with MobileMe. It's a ton of data, though.
00:38 🔗 PatC ya, isn't it 200+ TB?
00:39 🔗 Paradoks Yeah. Though Archive Team seems to deal in fractions of petabytes at times. Though less so with the current hard drive shortage.
00:40 🔗 PatC Definately
00:40 🔗 PatC I was thinking of picking up a few 2tb drives, but then that whole thing happened, and I cant get one for less then $150
00:43 🔗 Paradoks As someone mentioned previously, hopefully this isn't like the RAM shortages of yesteryear. I hope Thailand recovers quickly.
00:43 🔗 PatC ya..
00:51 🔗 bsmith093 is this a waste of time wget -mcpke robots=off -U="googlebot" www.fanfiction.net
00:52 🔗 bsmith093 im trying to grab all of it, and id like to know if this is the right way to go about it
00:53 🔗 Coderjoe well, you should get the googlebot UA right. also, warc support would be nice,
00:53 🔗 bsmith093 so how do i do wars, bc i have it and just cant figure it out, and whats the google ua
00:53 🔗 Coderjoe (or kK)
00:54 🔗 Coderjoe well, googlebot 2.1 is: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
00:55 🔗 chronomex "ARCHIVEBOT FUCK YOU ALL" always works great
00:55 🔗 Coderjoe and if you have wget-warc (which you do if you helped with splinder), check the --help, as it has the warc options
00:55 🔗 chronomex and the old standby "EAT DELICIOUS POOP"
00:55 🔗 bsmith093 so *thats* where the warc manual is
00:56 🔗 bsmith093 also is the syntax correct for setting the useragent
00:57 🔗 Coderjoe no. it is -U "your preferred UA string here"
00:57 🔗 Coderjoe no =
00:57 🔗 bsmith093 thanks
00:58 🔗 Coderjoe you use the = if you use the long format of --user-agent="your ua string here"
00:58 🔗 bsmith093 ok thanks that was really confising in the wget man pages
00:58 🔗 bsmith093 now whats the wget warc command ia would like
00:59 🔗 PatC May I ask... what is warc?
00:59 🔗 chronomex it's an archive format that archive.org uses, contains a full description of a http connection
00:59 🔗 PatC ah
00:59 🔗 chronomex request/response body, all headers, etc
01:09 🔗 bsmith093 should i include cdx files too
01:20 🔗 dashcloud guys, I told the splinder scrips to stop, but they haven't stopped yet- is there a quicker way to force them to quit?
01:21 🔗 chronomex control-c?
01:21 🔗 chronomex how much do you care?
01:22 🔗 dashcloud I don't want to lose anything, but I'm getting complaints about bandwidth from other people on my connection, so I'd like to get them finished up quickly
01:25 🔗 chronomex how many are you running? :)
01:26 🔗 dashcloud not that many I thought (150), but enough to seriously impact the wireless network in the house
01:27 🔗 chronomex ah, wifi. wifi doesn't like archivism.
01:27 🔗 bsmith093 so should i include cdx files as well
01:29 🔗 PatC is cdx also an archive.org format?
01:31 🔗 chronomex I have never heard of cdx before.
01:32 🔗 bsmith093 neither have i, its a warc option., so thats a no, then?
01:39 🔗 bsmith093 hows this for the command wget-warc -mcpkKe robots=off -U "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --warc-file=www.fanfiction.net/warc/warc www.fanfiction.net
01:42 🔗 bsmith093 thw upshot of this method is that the stories themselves can be had from a linklist made up of the fiile paths of nearly every file in the folder, because by some miracle, thats how their being saved
01:59 🔗 db48x2 bsmith093: yes, do the cdx file
02:00 🔗 db48x2 bsmith093: then, the next time you run this command wget will skip saving things to the warc if they already exist (and have the same content)
02:23 🔗 Paradoks I'll admit I'm only passingly familiar with Google Knol, Wave, and Gears. Are they the sorts of things we can usefully archive?
02:23 🔗 chronomex gears, no user content.
02:23 🔗 chronomex knol, dunno. wave, certainly.
02:24 🔗 chronomex I'm not sure what the utility of google wave data will be sans google wave
02:24 🔗 PatC Doesn't wave have privacy settings?
02:25 🔗 chronomex that too.
02:25 🔗 PatC Are things open by default? because chances are people will leave default security settings
02:28 🔗 SketchCow 17:15 -!- Irssi: Starting query in EFNet with anonymous
02:28 🔗 SketchCow 17:15 <anonymous> May I have an rsync slot?
02:28 🔗 SketchCow 18:02 <anonymous> May I have an rsync slot?
02:28 🔗 SketchCow 18:48 -!- anonymous [webchat@216-164-62-141.c3-0.slvr-ubr1.lnh-slvr.md.cable.rcn.com] has quit [Quit: Page closed]
02:28 🔗 SketchCow 18:48 <anonymous> May I have an rsync slot?
02:28 🔗 SketchCow 18:52 <anonymous> May I have an rsync slot>
02:28 🔗 SketchCow 19:18 -!- anonymous [webchat@216-164-62-141.c3-0.slvr-ubr1.lnh-slvr.md.cable.rcn.com] has quit [Ping timeout: 260 seconds]
02:28 🔗 SketchCow That's just not a good way to do it.
02:28 🔗 PatC :/
02:28 🔗 SketchCow It's called, Jason was out shopping
02:29 🔗 PatC Oh, your Jason S. ?
02:29 🔗 SketchCow Yep, my Jason S.
02:30 🔗 chronomex jason with the cat.
02:31 🔗 PatC SketchCow, I was 'Pat' from the Google hangout a few nights ago
02:31 🔗 PatC [20:33:09] <gmnevo> Wierd
02:31 🔗 PatC oops
02:31 🔗 SketchCow Hey
02:31 🔗 PatC (putty right click paste sorry)
02:31 🔗 PatC SketchCow, I got an external hard drive dock, as you suggested. Very nice to have
02:32 🔗 SketchCow Excellent
02:34 🔗 closure pity about Wave. Bet there's some very interesting and historic content in there
02:34 🔗 SketchCow Agree
02:34 🔗 SketchCow Also putty: My weblog has gotten the white screen of death
02:34 🔗 SketchCow No idea why
02:36 🔗 PatC Have you guys seen these little storage boxes? http://usb.brando.com/hdd-paper-storage-box-with-cover-5-bay-_p00962c044d15.html
02:36 🔗 PatC They seem to be a good, cheap way of storing hard drives
02:37 🔗 closure Knol sounds like it has content, dunno if it's publically available
02:56 🔗 PatC SketchCow, what is your blog url?
02:59 🔗 closure http://wayback.archive.org/web/*/http://ascii.textfiles.com
03:06 🔗 PatC Thank you closure
03:08 🔗 chronomex that is one way to get there
03:11 🔗 bsmith093 this is the command im now using ben@ben-laptop:~$ wget-warc -mpkKe robots=off -U "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --warc-cdx --warc-file=www.fanfiction.net/warc/warc www.fanfiction.net
03:12 🔗 bsmith093 and this is the current output Opening WARC file `www.fanfiction.net/warc/warc.warc.gz'. Error opening WARC file `www.fanfiction.net/warc/warc.warc.gz'. Could not open WARC file.
03:12 🔗 PatC bsmith093, that is the filepath?
03:13 🔗 bsmith093 yes i deleted the stuff wget had akready downloaded for the sake of doing it right, and recreated the www.fanfiction.net folder for the warc file
03:13 🔗 bsmith093 oh wait never mind hold on
03:14 🔗 bsmith093 yeah it cant find the warc file because its not there because its supposed to creat it?
03:16 🔗 PatC if you 'ls' in the www.fanfiction.com/warc/ folder is there the warc.warc.gz file?
03:18 🔗 bsmith093 no its empty
03:18 🔗 bsmith093 i havent stared the job yet, and this only happened after i added the --warc-cdx option
03:19 🔗 PatC ah
03:19 🔗 bsmith093 w/o it its fine
03:19 🔗 bsmith093 do i do the cdx later after the job is done as an update to it
03:19 🔗 PatC I'm not sure what --warc-cdx does so I can't help you with that, sorry
03:21 🔗 bsmith093 db48x2: do i run without the cdx option for th e first donload and then everytime after with the cdx to update?
03:25 🔗 bsmith093 im an idiot fanfiction.net not .com had the wrong filepath, all fixed now, runs great with cdx option set
03:25 🔗 PatC ah, what would do it
03:36 🔗 PatC Wait, i'm sorry, I was the person who mentioned .com
03:49 🔗 SketchCow Works
03:51 🔗 PatC nice!
03:51 🔗 PatC What was the problem?
03:57 🔗 db48x2 bsmith096: use the cdx option every time
04:01 🔗 SketchCow I overlaid a new install of wordpress
04:01 🔗 SketchCow THANKS WORDPRESS
04:01 🔗 SketchCow THE RESET YOUR ROUTER OF REPAIRS
04:08 🔗 PatC SketchCow, congrats on your kickstarter! I can't wait to see the end result :)
04:14 🔗 bsmith096 SketchCow: does the wget warc create the warc file at the very end, because im running one of my own download jobs, and its not creating it yet
04:21 🔗 SketchCow alard can tell you
04:21 🔗 SketchCow I don't know actually.
04:27 🔗 bsmith096 alard: does the wget warc create the warc file at the very end, because im running one of my own download jobs, and its not creating it yet
04:28 🔗 bsmith096 documentary question, how did you decide to focus on the 6502 chip?
04:30 🔗 yipdw^ bsmith096: in wget 1.13.4-2574, the WARC is built as each file is downloaded
04:30 🔗 yipdw^ if you have a different version, the behavior may be different
04:31 🔗 chronomex bsmith096: imo, 6502 is the most widely used 8-bit processor for hobbyists during a certain time.
04:31 🔗 chronomex it also retains some lasting appeal ... the z80 is still used but is still used industrially and in gameboys, etc, so it's different.
04:33 🔗 bsmith096 yipdw^: but the file doesnt exist yet, so is it builing it in ram or something and then wirting at the end, because the path where i told wget warc to put the warc file is currently empty
04:34 🔗 yipdw^ bsmith096: dunno. maybe building an index file changes the behavior
04:43 🔗 Coderjoe no
04:43 🔗 Coderjoe i think it may have failed to open the warc file because the directory didn't exist
04:48 🔗 yipdw^ oh
04:48 🔗 yipdw^ or that
04:51 🔗 Coderjoe also, may I suggest a better base filename than "warc"
04:52 🔗 Coderjoe (since it will have .warc or .warc.gz added to the end already)
04:53 🔗 yipdw^ also, I have archived Kohl's terrible Black Friday ad, in the event that they develop a sense of shame
04:53 🔗 Coderjoe may I suggest something like "www.fanfiction.com_20111122" or the like
04:53 🔗 Coderjoe oh?
04:53 🔗 Coderjoe I've not looked at any BF ads.
04:53 🔗 yipdw^ http://www.youtube.com/watch?v=vGiQzPi0f_E
04:53 🔗 yipdw^ a friend linked me to it
04:53 🔗 yipdw^ I feel sad
04:54 🔗 Coderjoe her, SHIT
04:54 🔗 * PatC shreds his screen
04:54 🔗 Coderjoe er...
04:55 🔗 Coderjoe after the related vids loaded, along with the description, the pieces fell together...
04:55 🔗 Coderjoe and THEN the video started playing
04:57 🔗 yipdw^ I didn't get the comments or the video info page, though
05:01 🔗 Coderjoe not completely positive, but I think Rebecca is the woman in the red. the one getting into the store after the woman singing, and the one she steals the item from and then flicks off
05:01 🔗 Coderjoe baha. did you catch the very end of the video?
05:01 🔗 bsmith096 Coderjoe: apparently the warc file has to be in the folder ithe wget is saving to not under it, also took your advice and renamed it, and now restarted the job, and the war and cdx now exist so whoo!
05:02 🔗 Coderjoe it can be in a different directory, but the directory has to exist at the start of the job
05:02 🔗 bsmith096 well its working now so yay
05:04 🔗 bsmith096 it's running beautifully
05:37 🔗 chronomex okay guys.
05:38 🔗 chronomex http://colour-recovery.wikispaces.com/Full+gamut+colour+recovery these guys are working on recovering color tv
05:38 🔗 chronomex starting with scans of film that was exposed in a machine that records tv onto film
05:38 🔗 chronomex they're re-rectangularizing the frames and extracting color from the dot crawl you get on b/w tv
05:38 🔗 chronomex it's fucking impressive.
05:43 🔗 SketchCow I downloaded all the the ROFLcon summit videos.
05:43 🔗 SketchCow Including myself.
05:43 🔗 SketchCow http://www.archive.org/details/roflconsummit-cpw
05:52 🔗 underscor Brewster and you?!
05:55 🔗 Coderjoe chronomex: someone was telling me about that, restoring color to old kinescopes of doctor who
05:55 🔗 chronomex yeah
05:55 🔗 chronomex it's cool shit.
05:56 🔗 chronomex modern signal processing is kind of magic
06:02 🔗 Coderjoe i stil mildly dislike the old heads at the beeb. the ones that decided to destroy their archives.
06:02 🔗 chronomex they had reasons
06:11 🔗 SketchCow Oh, don't justify them.
06:27 🔗 chronomex I didn't say they were good reasons
06:27 🔗 chronomex but they didn't just say HURRDURR BALEET
06:37 🔗 SketchCow Someone did
06:37 🔗 SketchCow I believe it was TALLY BALEET
06:37 🔗 db48x lol
06:45 🔗 chronomex this is horrible. I'm moving into a smaller space and have to winnow :(
06:45 🔗 SketchCow Noooooo
06:46 🔗 chronomex is why I have some understanding for what happened at the BBC
06:46 🔗 chronomex mostly I have a bunch of paper detritus
06:55 🔗 arima SketchCow: can I get an rsync slot?
07:28 🔗 Nemo_bis what does it mean if I have over 110 instances of the splinder downloader running but only 70 instances of wget?
07:29 🔗 Nemo_bis there can't be 40 in the process of parsing the profile pages because there's no such python process currently
07:45 🔗 Coderjoe Nemo_bis: how high is your load avg? what does your memory usage look like? there can be a number trying to parse if there is heavy disk IO, or you're short on ram and not able to cache stuff for log, or swapping, etc
07:50 🔗 dnova this is funny; we're down to the bigger profiles (not counting the ones alard has not put back into the queue)
07:54 🔗 Coderjoe hah
07:54 🔗 Coderjoe it:MagicaHermione
07:56 🔗 Nemo_bis Coderjoe, I'm using all memory and disk load is very high, but there's nothing to parse, it's been downloading the same big users for days
07:58 🔗 dnova there seem to be a small amount of extreme splinder users
07:58 🔗 Coderjoe weird
07:59 🔗 Coderjoe just had the same downloader-user pair show up twice on the dashboard
07:59 🔗 dnova yeah I saw that too!
07:59 🔗 dnova Nemo it:Luley 0MB
07:59 🔗 dnova Nemo it:Luley 0MB
07:59 🔗 Nemo_bis hm
08:00 🔗 Nemo_bis I was just looking at that user
08:01 🔗 Nemo_bis uh, there's some user being downloaded at 10 Mb/s, what a joy
08:04 🔗 Nemo_bis machine is too slow to open browser now, sorry if I flood the channel
08:04 🔗 Nemo_bis - Downloading profile HTML pages... done.
08:04 🔗 Nemo_bis - Parsing profile HTML to extract media urls... done.
08:04 🔗 Nemo_bis Deleting incomplete result for it:Luley
08:04 🔗 Nemo_bis Downloading it:Luley profile
08:04 🔗 Nemo_bis it:Luley contains 502 or 504 errors, needs to be fixed.
08:04 🔗 Nemo_bis - Downloading 4 media files... done, with HTTP errors.
08:04 🔗 Nemo_bis - Checking for important 502, 504 errors... none found.
08:04 🔗 Nemo_bis - Result: 134K
08:04 🔗 Nemo_bis rm: impossibile rimuovere "data/it/L/Lu/Lul/Luley/.incomplete": File o directory non esistente
08:04 🔗 Nemo_bis Telling tracker that 'it:Luley' is done.
08:04 🔗 Nemo_bis blogs.txt media-urls.txt splinder.com-Luley-html.warc.gz splinder.com-Luley-media.warc.gz wget-phase-1.log wget-phase-2.log
08:04 🔗 Nemo_bis ls data/it/L/Lu/Lul/Luley
08:05 🔗 Coderjoe you happened to have two threads doing the same user? friggin awesome :-\
08:05 🔗 Nemo_bis no, I don't think so
08:05 🔗 Coderjoe it would explain the double-complete and that rm error
08:05 🔗 Nemo_bis I think that for some reason it marked it complete two times, the second failing to remove the .incomplete (because already deleted) but telling to tracke two times
08:06 🔗 Coderjoe (one thread deleted the .incomplete file before the other)
08:06 🔗 Nemo_bis perhaps due to disk load it queued the "remove and tell to tracker" two times? (I don't know how it works)
08:07 🔗 Coderjoe no, it wouldn't have done that
08:07 🔗 Nemo_bis ah, you're right
08:07 🔗 Coderjoe unless there were two dld-single.sh instances working on the same user at the same time
08:07 🔗 Nemo_bis hm, this shouldn't happen
08:09 🔗 Coderjoe no idea how that would have happened unless you stopped dld-streamer.sh with ^c and didn't actually stop all the children, and then started downloaders through another means
08:09 🔗 Nemo_bis no, I'm running two instances of fix-dld
08:09 🔗 Coderjoe oh
08:09 🔗 Coderjoe don't do that
08:09 🔗 Nemo_bis because there are too many users to fix and it's very slow, and they shouldn't conflict
08:10 🔗 Coderjoe they will conflict
08:10 🔗 Nemo_bis but the first which finds a user adds .cinomplete and the second ignores it
08:10 🔗 Nemo_bis it's always worked :-/
08:11 🔗 Coderjoe only because you were lucky before on the race conditions
08:12 🔗 Nemo_bis but how can they conflict, doesn't the fixer add the .incomplete mark as soon as it starts working on a user?
08:12 🔗 Coderjoe yes, but if the one got past that check just before the other makes the file, they will BOTH think they can work on that user
08:12 🔗 Coderjoe like I said, RACE CONDITIONS
08:13 🔗 Coderjoe checking for the file and creating the file are not atomic operations
08:13 🔗 Nemo_bis yes, I understand, but... I thought that writing an empty file wouldn't be a problem. It happened 4 times though, stopping now...
08:14 🔗 Nemo_bis Before us.splinder.com went down I was doing one for it users and one for us user
08:15 🔗 Coderjoe fix-dld doesn't have a means of specifying which country to work on
08:16 🔗 Nemo_bis I just modify the script
08:17 🔗 Coderjoe if you look at the script, there is a rather large time gap between checking for .incomplete and creating it
08:17 🔗 Coderjoe (the grep)
08:18 🔗 Nemo_bis ah
08:19 🔗 Nemo_bis indeed
08:20 🔗 Nemo_bis although in this was it checks for way more .incomplete than needed and it saves just a few greppings
08:21 🔗 Coderjoe eh? I can't quite parse that. (I should also probably go to bed soon as well)
08:24 🔗 Coderjoe if you wanted to make it more parallel, the way to go about it would be to write out the list of need-redo users to a file and then (carefully) modify dld-streamer.sh to read from that file instead of asking the tracker. (but do not run that dld-streamer until your fix script is done identifying all users to redo)
08:25 🔗 Coderjoe (I think this approach is a bit easier and less error prone than trying to bring in all the job control logic from dld-streamer into fix-dld)
08:29 🔗 Nemo_bis yes, but I don't have *that* many users to fix
08:30 🔗 Coderjoe right, but you wanted to run the downloads in parallel. fix-dld was not written to support running multiple instances on the same data directory at the same time
08:30 🔗 Nemo_bis what I meant is that you usually put the less expensive and more effective condition first, so it makes sense that fix-dld checks for .incomplete as first thing, but it's not actually needed if the user doesn't need to be fixed
08:31 🔗 Nemo_bis yep, I just gave up, I'll be patient :-p
08:31 🔗 Coderjoe the check for .incomplete was to allow fix to be run while downloaders are also running
08:31 🔗 Coderjoe (as stated in the comment above the check)
08:31 🔗 Nemo_bis so perhaps it could be more efficient to check for the .inomplete after grep
08:31 🔗 Nemo_bis yes, I meant ^
08:32 🔗 Schbirid is there a way to create a new git repo and upload files using the website at github?
08:32 🔗 Schbirid wait nevermind
08:32 🔗 Schbirid i forgot i published it already ( https://github.com/SpiritQuaddicted/sourceforge-file-download ) :D
08:35 🔗 Schbirid hm, moddb might be a worthy target
08:44 🔗 SketchCow http://ascii.textfiles.com/archives/3395
08:47 🔗 chronomex SketchCow: you left his mail address in once, dunno if that was intentional
08:49 🔗 SketchCow Un.
08:49 🔗 SketchCow Fixing.
08:50 🔗 SketchCow Fixed.
08:50 🔗 * Schbirid posts unredacted version to archive.org
08:50 🔗 Schbirid just kidding
08:52 🔗 chronomex hm. google suggests that someone with that email address is into penny stocks
08:52 🔗 chronomex interesting.
08:53 🔗 Nemo_bis this makes me think of some emails in my inbox... :"(
08:53 🔗 Nemo_bis oh well, gotta go
08:54 🔗 chronomex Once I sent an email in 2004 and got a response in 2009.
08:54 🔗 chronomex These things happen.
08:54 🔗 chronomex actually 2008
08:54 🔗 Schbirid those things rock
08:54 🔗 chronomex so I waited to 2011 to reply back.
08:55 🔗 Schbirid sounds like my astonishing talent of conversation with girls i fancy
08:55 🔗 chronomex ?
08:56 🔗 Schbirid gets hit on in 2004, realises in 2008
08:56 🔗 Schbirid :D
08:56 🔗 chronomex smrt.
08:56 🔗 SketchCow Sleep with in 1996, realize she was using you in 2007
09:01 🔗 Schbirid )
09:01 🔗 Schbirid :)
09:02 🔗 SketchCow This was just the therapy to be able to reach some sort of thing with this guy.
09:02 🔗 SketchCow Send him his drive back.
09:02 🔗 SketchCow be done with it.
09:02 🔗 SketchCow Now, he percieves a crime worthy of monetary damages.
09:02 🔗 SketchCow That's interesting.
09:02 🔗 SketchCow I'd like to see that court case.
09:11 🔗 Schbirid opera makes it much too easy to collect too many open tabs over the week
11:41 🔗 emijrp google knol is closing
11:43 🔗 db48x indeed
11:43 🔗 db48x can we save it?
11:43 🔗 emijrp DUDE.
11:45 🔗 db48x hmm
11:45 🔗 db48x firefox just crashed
11:45 🔗 db48x it doesn't usually do that
11:45 🔗 emijrp http://googleblog.blogspot.com/2011/11/more-spring-cleaning-out-of-season.html
11:48 🔗 emijrp Google is now on the Archive Team black list.
11:51 🔗 db48x well, we have five months
11:52 🔗 chronomex google joins yahoo in the Bad Decision Club
12:15 🔗 ZoeB hi! does anyone know how to get curl to play nicely with Yahoo!'s login screen?
12:16 🔗 chronomex hm. yahoo's login screen barely works for me in a real web browser :P
12:16 🔗 ZoeB I've got as far as trying the following:
12:16 🔗 ZoeB curl -c cookie.txt -d "login=zoeblade&passwd=foo&submit=Sign In" https://login.yahoo.com/config/login
12:16 🔗 ZoeB curl -b cookie.txt -A "Mozilla/4.0" -O http://launch.groups.yahoo.com/group/foo/message/[1-23217]?source=1
12:16 🔗 chronomex I'd suggest looking at the program 'fetchyahoo' which downloads mail from a yahoo account
12:16 🔗 chronomex that has some rather robust login code
12:17 🔗 ZoeB but although it does save a cookies file, it keeps on trying to redirect me to the login screen, so presumably I haven't logged in correctly
12:17 🔗 chronomex or it did when I used it few years ago :]
12:17 🔗 ZoeB good idea, thanks
12:17 🔗 chronomex hm.
12:17 🔗 chronomex good luck! it's 4am and I should have been in bed hours ago :|
12:23 🔗 ZoeB sweet dreams ^.^
12:32 🔗 Hydriz Hey people
12:37 🔗 emijrp hi
12:37 🔗 db48x hello Hydriz
12:39 🔗 Hydriz How is the day?
12:39 🔗 Hydriz Just seen the Knol news
12:41 🔗 db48x the day is early
12:42 🔗 Hydriz lol
12:43 🔗 Hydriz just feel like jumping in and start archiving Knol
12:43 🔗 Hydriz just love the archiving feeling
12:46 🔗 db48x great
12:46 🔗 db48x the place to start is by exploring the site and seeing how it's organized
12:46 🔗 db48x can we download things by user, or by some other enumerable index?
12:50 🔗 Hydriz looks like it is sorted by user?
12:50 🔗 Hydriz or maybe...
12:51 🔗 Hydriz Lets take an example: http://knol.google.com/k/scott-jenson/scott-jenson/6b7e08nms1ct/0#knols shows knols the user created
12:52 🔗 db48x excellent
12:59 🔗 Hydriz are you writing the script now?
13:02 🔗 db48x no; I'm far too tired to be of any use there
13:03 🔗 Hydriz LOL
13:44 🔗 ersi Funny how the Knols he wrote is that they've been viewed strangely
13:44 🔗 ersi 27k times, 3k, 5k
13:44 🔗 db48x nah, google estimates on that sort of thing all the time
13:45 🔗 db48x number of search results, word counts in books, etc
13:45 🔗 ersi those bastards
13:45 🔗 db48x all estimates posing as real numbers
16:56 🔗 SketchCow HEY IT IS THE FAMOUS ZOE BLADE
18:01 🔗 closure SketchCow: hey, I think you left out your own talk on http://www.archive.org/details/roflcon-summit
18:03 🔗 PatC yay, a download!
18:06 🔗 SketchCow It's there, but you have to click on "all"
18:10 🔗 closure doh! one day this will have a better interface, I'm sure
18:14 🔗 closure on the plus side, that's another successfull transaction, and I didn't even blackmail you
18:40 🔗 ersi hah
18:48 🔗 Coderjoe wow
18:49 🔗 Coderjoe I just finished a 243MB user
18:52 🔗 closure I have users who have been going for 3 days now.
18:54 🔗 Coderjoe hmmm
18:55 🔗 Coderjoe you know what might be handy? a ramfs/tmpfs-disk hybrid filesystem
18:56 🔗 Coderjoe you set a maximum ram usage and it keeps recently used/accessed stuff only in ram. when that ram space is exceeded, stuff that hasn't been touched in awhile gets flushed out to a backing directory on a hard drive
18:58 🔗 Coderjoe this way, you get the benefits of tmpfs/ramfs for the files that get downloaded and grepped, but don't potentially fill it up with additional stuff that is downloaded and then not touched
19:01 🔗 Coderjoe and is better than just the disk cache over a normal filesystem because stuff doesn't even hit rotational media until it gets evicted from ram
19:09 🔗 bsmith094 i just got back, google knol is going away?
19:12 🔗 db48x2 yea, I had 26 users still going when I left this morning
19:12 🔗 db48x2 bsmith094: yep
19:13 🔗 bsmith094 so is splinder done yet? fully?
19:13 🔗 db48x2 nope
19:13 🔗 db48x2 us.splinder.com is still down
19:13 🔗 db48x2 so those users are out of circulation
19:13 🔗 bsmith094 not done, beause the tracker for the streamer script is down
19:13 🔗 bsmith094 ??
19:14 🔗 db48x2 and we all need to check our data for incomplete or broken users and finish them
19:14 🔗 db48x2 bsmith094: the tracker isn't handing out any us users, so the script will see it as done
19:16 🔗 bsmith094 ok i just spent 14 hrs doing a wget of fanfiction.net, and why fo i have blah and blah.orig
19:16 🔗 db48x2 cool. how big is the result?
19:17 🔗 marenostr Hi dear friends! Can anyone of you with Windows operating system and Internet Explorer web browser help me just for a few minutes? I'm dealing with a bug in a file of Project Gutenberg and I don't use Windows/IE, I'm on a GNU/Linux box and online IE rendering tools is not helpful for my case. I want to learn what you see on page www.gutenberg.org/dirs/etext06/8loc110h.htm under the expression "digestibility will be acquired." (without quotes). An
19:17 🔗 bsmith094 du -ach isnt done yet
19:17 🔗 marenostr image? Or some text? What? Thanks in advance!
19:17 🔗 db48x2 bsmith094: :)
19:17 🔗 bsmith094 so whats the .orig files and can i get rid of them?
19:18 🔗 db48x2 marenostr: this isn't really the right channel for that kind of question
19:18 🔗 db48x2 bsmith094: dunno. check your wget log and see when it created them
19:18 🔗 db48x2 maybe the server just had files with that name on it
19:18 🔗 bsmith094 where would the log be
19:18 🔗 db48x2 where did you save it?
19:18 🔗 marenostr db48x, OK. Sorry. Googling gave me that impression. Sorry.
19:19 🔗 bsmith094 wget wwwfanfiction.net
19:19 🔗 db48x2 hrm
19:19 🔗 db48x2 marenostr: what specifically did you see that gave you this impression?
19:19 🔗 db48x2 bsmith094: did you specify a -o or -a option?
19:23 🔗 marenostr db48x, On page http://archiveteam.org/index.php?title=Project_Gutenberg , right side, at the bottom of the box says. IRC channel: #archiveteam and this is meant for Project Gutenberg. It gave me that impression.
19:26 🔗 bsmith094 db48x2: i used this wget-warc -mpkKe robots=off -U "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --warc-cdx --warc-file=www.fanfiction.net/fanfiction_20111122 www.fanfiction.net
19:26 🔗 db48x2 bsmith094: so wget sent it's log to stdout
19:27 🔗 db48x2 bsmith094: did you redirect stdout to a file?
19:27 🔗 db48x2 for example: wget ... www.fanfiction.net > archive.log
19:27 🔗 bsmith094 errr... no, but in hindsight that would have been a good ides
19:27 🔗 db48x2 indeed :)
19:27 🔗 PatC Evening folks
19:28 🔗 db48x2 marenostr: ah. Archive Team is all about archiving webpages, and one of the webpages we have archived/are archiving is Project Gutenberg :)
19:28 🔗 db48x2 PatC: howdy
19:28 🔗 bsmith094 to give an idea of how large this site is for being mostly text, the warc file is 850mb
19:31 🔗 bsmith094 doesnt PG have a problem with robots bulk downloading
19:33 🔗 bsmith094 my disk currently hates me, so im gonna let du -ach run afor a while
19:33 🔗 PatC Pulling 2MB/s off archive.org I didn't know their internet connection was this good / not limited
19:34 🔗 ndurner Good/unlimited internet connections are 2 GB/s nowadays
19:35 🔗 PatC wow!
19:39 🔗 Coderjoe bsmith094: if you used -k and -K the -K means back up the original files before changing the links. (which isn't really needed if you were doing a warc as well, since the warc would have the original data the server sent)
19:40 🔗 Coderjoe also, if you let wget run to completion, a copy of the log should be in the warc file
19:40 🔗 db48x2 ah, right, -K
19:44 🔗 Coderjoe haha
19:45 🔗 Coderjoe that's mildly amusing
19:45 🔗 Coderjoe google had already indexed the blog post before it was revised to remove the email address
19:50 🔗 db48x2 heh
20:24 🔗 bsmith094 how do i rsync my splinder data
20:25 🔗 bsmith094 never mind got it
20:29 🔗 alard Hi people; Splinder seems to be back to normal. Is it time to requeue the leftovers?
20:34 🔗 closure leftovers? after thanksgiving
20:34 🔗 closure srsly, if you want to queue some stuff, I'm game
20:39 🔗 DFJustin <db48x2> and we all need to check our data for incomplete or broken users and finish them
20:39 🔗 DFJustin how does one go about this
20:40 🔗 bsmith094 i would like to know as well. is there a script for that? :)
20:46 🔗 bsmith094 ive noticed a file in the splinder grab, called eror userrnames, how would i pipe that into dld-single?
21:34 🔗 emijrp knol guys, knol
21:35 🔗 PatC What's that?
21:38 🔗 emijrp Google Knol.
21:40 🔗 SketchCow Yeah
21:40 🔗 SketchCow Saw you talking about it.
21:41 🔗 SketchCow Let's requie splinder, that that done.
21:43 🔗 alard SketchCow / others: I put the splinder items back into the queue about 1 hour ago. 254466 to do. (And the site is sluggish again, perhaps there is a correlation?)
21:45 🔗 bsmith093 trackers back up ... downloading...
21:47 🔗 bsmith093 hey cool im acutally on the board
21:50 🔗 db48x2 bsmith093: :)
22:03 🔗 SketchCow Good deal.
22:03 🔗 SketchCow Now, help me here, though.
22:03 🔗 SketchCow That means people have given me users that are, in fact, empty, and that someone will give me different users.
22:07 🔗 alard Yes, so don't mix them up. :)
22:07 🔗 alard (Although at this point I've requeued items that were never marked done, so probably those are still unfinished and you won't get them.)
22:11 🔗 SketchCow We'll have to do some level of checking.
22:12 🔗 SketchCow By the way, dragan is completely redoing/cleaning the Geocities torrent.
22:12 🔗 closure SketchCow: did you see the Berlios cleanup script I wrote you?
22:13 🔗 SketchCow Yes, but you'll need to e-mail me details.
22:14 🔗 closure comments at the top should explain it all
22:21 🔗 Coderjoe I just made some changes to dld-streamer.sh (and tested them)
22:21 🔗 Coderjoe there is now an optional third parameter. if given, it is the name of a file to read usernames from, rather than asking the tracker.
22:22 🔗 Coderjoe this is to allow the use of the job management of dld-streamer.sh to retry previously-failed usernames.
22:23 🔗 Coderjoe but I am going to guess a large number of those have been moved back into the todo list on the tracker
22:25 🔗 bsmith093 Coderjoe: thats exactly what i was asking about thanks!
22:25 🔗 bsmith093 error-usernames is a list
22:39 🔗 emijrp how can i know which hard disk brand goes inside a verbatim 2TB drive?
22:40 🔗 Coderjoe crack it open, or google the model and see if anyone else has cracked it open?
22:43 🔗 amerrykan semi-ontopic for this channel: Someone found the original workprint to "Manos: The Hands of Fate" and intends to restore it - http://forums.somethingawful.com/showthread.php?threadid=3450845
22:43 🔗 amerrykan http://www.manosinhd.com/
22:48 🔗 Coderjoe ...

irclogger-viewer