[00:08] Syncronet is still being developed. I re-learned this in the past few days while reminiscing about my old BBS software ( http://tech.groups.yahoo.com/group/KBBS/ ) [00:08] Err, clients. Whoops. [00:09] I figured people just used telnet. [00:14] if net connected, usually just telnet or the like [00:14] (unless you want to connect to a system with RIP or some custom graphical client program) [00:15] (mmmm RIPTERM... something I only used twice or so) [00:20] still need phone #s [00:20] \?? [00:21] no idea where to get current numbers [00:24] are there current numbers? [00:24] how would it work with telnet [00:24] you need to know a hostname or ip address to telnet to [00:25] miku.acm.uiuc.edu [00:27] you can also use dosbox to redirect telnet to an emulated COM port and use old DOS terminal software [00:28] hahahah [00:29] "taking a moon lander out to do riceboy drifting out in the parking lot" [00:30] oh it's even easier than I thought in dosbox, you just dial an IP address instead of a phone number [00:32] Thank you, rude___ [00:37] Coderjoe: haha very funny [00:39] nyancat or whtatever its called [00:41] although nice trick, i didnt even know the terminal could recieve color [00:41] xterm-color, ANSI, etc [00:42] SketchCow: anything besides the insanely huge mobilme to archive, something smaller? perhaps fanfiction.net? would love to help with that :) tried wget -mcpk, didnt get all of it though, weird [00:44] even with the useragent workaround, stopped at 300Kfiles, and i know there are at least 2million stories [00:44] Coderjoe: did you see that episode of Top Gear where they actually drove the real moon buggy? [00:44] no. [00:45] i want to now [00:45] I recommend it :) [00:45] it's the one they developed for a future moon mission, that may or may not ever be used [00:45] it's got a pressurized cabin [00:45] 6-wheel drive [00:45] theres another moon mission?! [00:45] full independant suspension [00:46] i want one ! [00:46] coolest...moon buggy...ever!! [00:46] each wheel is independantly steerable [00:46] the console inside gives you diagnosics on each wheel, showing you how much power is being applied and so on [00:47] wait... why independently? [00:47] wouldnt you usually be pointing them all in roughly the same direction at any given time? [00:47] the wheels might not all be touching the ground at the same time [00:47] they have a lot of vertical travel so that you can go over rocks [00:47] yea, in general [00:47] oh.. yeah... duh moon grav.. wow i feel stupid [00:47] but you might want to spin in place [00:48] like donut, or spin actually in place [00:48] someone hasn't seen things like zero-point-turn commercial mowers and stuff [00:48] (though those go by a different means, like tanks) [00:48] ZERO POINT TURN?!?? for a LAWN MOWER!?! why? [00:48] yea [00:49] heh [00:49] commercial mowers. so they can mow a field faster and get more jobs in during a day [00:49] thats like, i have a sleep disorder, oh heres a TIME MACHINE! [00:49] lol, great reference [00:49] oh well commercial mowers, well that acually makes sense [00:49] hmm, think chapter 78 is up yet? [00:49] more jobs means they can get mor income [00:50] yeah, hpmor rocks [00:51] http://www.topgear.com/uk/photos/topgear-moon-drive?imageNo=1 [00:51] heres all of it on one convenient file [00:52] I have it on my phone as an ebook too [00:52] neat [00:52] eh, what is this? [00:52] vague much [00:52] Coderjoe: Harry Potter and the Methods of Rationality? [00:53] oh, my bad. it's 12-wheel drive [00:53] 6 pairs of wheels [00:53] no, i meant the "hpmor" [00:54] yea, HPMoR, Harry Potter and the Methods of Rationality [00:54] ACRONYMS ARE U=YOUR FRIEND [00:54] stupid caps [00:54] Coderjoe: http://www.fanfiction.net/s/5782108/1/Harry_Potter_and_the_Methods_of_Rationality [00:57] Coderjoe: I recommend it even if you don't generally like fan fiction, but be forewarned: your laughter will annoy your housemates/coworkers [01:00] every story has its own unique id number , they are apparently sequential, hey come to think of it i have a fanfiction downloader, that takes link lists [01:01] ill just generate all possible story ids and plug them into that [01:02] :) [01:03] integrate it with the scripts we used for splinder/mobileme/anyhub [01:03] then we can all help out [01:07] not really sure how, what im doing ( or trying to do) will just grab all the stories, ( hopefully) check link by link and download into a text file by category author and storyname, using this little binary blob linux app i found here, fanfictiondownloader.net [01:12] ok well my generator is choking, trying to pump out 10million links at once, so basically whats the command to generate these http://www.fanfiction.net/s/[0000000-9999999]/1/ note the regex im trying to use [01:12] 0 to 9 999 999 [01:13] probably nowhere near that many storeies but the id's are all over the place [01:15] wow. I never knew there was a printer acessory for the game boy [01:15] heh [01:15] (and I had a 1st gen game boy) [01:18] there was a printer available for my favorite calculator [01:20] bsmith093: this is one way, but it takes forever: for i in {0000000..9999999}; do echo $i >> file.txt; done [01:21] actually [01:22] bsmith093: echo {0000000..9999999} | tr ' ' '\n' > file.txt [01:22] should be a lot faster [01:24] ok but with the links around the numbers [01:24] yeah. took a little less than a minute just now. generated a 70 megabyte file [01:24] oh [01:24] sorry this things very p0icky that way [01:24] yeah sure [01:25] i can run it from here though :) once i have the command, or if i can ever figure out regex like this for myself :) [01:26] bsmith093: i dunno if generating the numbers beforehand is faster but, this is what you'd run after that last one (echo | tr thing): while read $num; do echo http://www.fanfiction.net/s/$num/1/ > linklist.txt; done < file.txt [01:26] i swear every script i've ever seen thats more complicated that wget her and grab this put it there, looks like chinese to me [01:26] oh. well. there's that if you want/need it for inspiration [01:26] yeah bash on a single line isn't too friendly [01:26] thanks [01:26] keep in mind that /1/ needs to be incremented as well until you run out of chapters [01:27] oh, if so then the linklist would just have chapter1 for everything [01:27] how many chapters should we look for for each? [01:28] er [01:28] bsmith093: the command should actually have "> linklist.txt" at the end: while read $num; do echo http://www.fanfiction.net/s/$num/1/; done < file.txt > linklist.txt [01:29] number of chapters depends on the story [01:29] on my end its still generating the file of numbers [01:30] bsmith093: the "echo {0000000..9999999} | tr ' ' '\n' > file.txt" ? [01:30] yes [01:30] ah, ok [01:30] holy crap. VASTLY different youtube front page, too [01:30] file should be around 77MB so you can track its progress looking at that [01:30] bsmith093 [01:31] Coderjoe: yeah but similar to how some ids will not go to real stories, one must kind of pick a default for how many chapters to look for. stopping trying to download more than 4 if say chapter 4 isn't found would be good, but that requires tool support [01:32] no prob SketchCow.. got more coming your way soon [01:32] did comcast start speeding up download speeds? [01:32] i only ask cause i have 800kbytes down [01:33] googling found this http://bashscripts.org/forum/viewtopic.php?f=8&t=1081 [01:33] godane: in some areas they increase the dl speed. people usually get an email [01:33] the thing i have at fanfictiondownloader will auto download if it has more than one chapter, i just have to find out if it wwill continure upon find ing an invalid is [01:34] oh [01:34] is sorry this is really slowing down my laptop [01:34] *id* oy vey typos [01:34] bsmith093: yeah. maxed out my computer for a little bit. you can renice it and it'll go slower but not take over so much [01:37] new frontpage: http://i.imgur.com/RPw6K.png [01:38] Coderjoe, yep :/ [01:38] wow [01:39] that's quite a change [01:41] how do i track a files changes in realtime cli [01:42] when another process is editing it [01:45] bsmith093: "tail -f file.txt" will output lines getting added to a file. but what i'd do if i were you is "watch ls -lh" in the directory you're generating the txt [01:45] watch reruns a command, by default every 2 seconds, so you can see how big it's getting [01:46] tail -f might slow it down is why i say ls over tail [01:48] its at 270MB and rising [01:48] linklist [01:48] er [01:48] oh [01:48] i thought for a sec you meant file.txt, heh that'd be way too big [01:59] btw seems the overall count is 10^7 - 1 [01:59] for ease of notation [02:00] that finally completed linklist is full of these http://www.fanfiction.net/s//1/ [02:00] inbetween the double slashes is where the id gpes [02:01] sorry minor glitch there, and i cant see why [02:01] stopped growing and completed at 306mb [02:01] hmmm [02:01] bsmith093: are you on ubuntu or osx? [02:01] ubuntu [02:02] lucid 10.04 32bit [02:03] where was i suppoosed to run the while read $num; do echo http://www.fanfiction.net/s/$num/1/; done < file.txt > linklist.txt casue i just ran it in the terminal, like the generator command earlier [02:03] "I can't believe my dress ripped. They saw everything! Even my Ranma panties. They change color when I get wet." [02:04] bsmith093: ah yeah i'm getting the same result. sorry about that [02:04] wait how does it know where $num is? [02:05] bsmith093: the "while read $num" is supposed to operate on each line of the file piped in, which is "< file.txt" [02:05] oh, see this is why im going to be taking a sed and bash scripting class in college [02:07] errr [02:07] bsmith093: "while read num" not "while read $num" [02:07] um ok then re running now [02:07] while read num; do echo http://www.fanfiction.net/s/$num/1/; done < file.txt > linklist.txt [02:07] all the same except for that first part [02:08] sorry about that [02:09] running perfectly now just gotta wait again [02:09] meantime lets see what my actual downloader will do to an invalid id [02:12] bsmith093: good idea [02:13] bsmith093: btw by my calculations the resulting file should be 80000000 + (10^7-1) * 31 bytes or 371.932954 megabytes (according to google) [02:14] http://www.google.com/search?q=(80000000+%2B+(10^7-1)+*+31)+bytes+to+megabytes [02:16] well it works great if the link is valid, other wise it dies [02:17] althought there is always the scripts its based on.. hold o0n a min [02:19] here grab this http://fanficdownloader.googlecode.com/files/fanficdownloader-4.0.6.zip [02:20] ahh python [02:22] ok now go in and take a look at downloader.py apparently the only thing it CANT do is read links from a file [02:23] heh [02:23] is there a pipe for that? [02:23] welll if i knew python it shouldn't be too hard to add that functionality [02:23] oh [02:23] i'd do a bash "while read" [02:24] http://boingboing.net/2011/12/01/your-tax-dollars-at-work-misl.html [02:24] bsmith093: while read link; do python downloader.py $link; done < linklist.txt [02:24] er [02:24] but if you want .html you have to specify that manually it says [02:25] so [02:25] bsmith093: while read link; do python downloader.py -f html $link; done < linklist.txt [02:25] bah [02:25] i'd try with 3-5 links before running it on linklist [02:25] python is simple [02:26] Coderjoe: not for someone with a huge mental block against learning things in one sitting. i've been meaning to learn it for like a years now. ;( [02:26] what programming languages do you know? [02:26] yeah apparently this script was heavily modified to make the binary blob i found, but he did say that, so,... anyway this one wants full urls, not just nice sequential ids [02:27] (and a real programmer should be able to figure out other languages of the same type fairly easily) [02:27] fanficdownloader.net is there any way to see inside a linux blob [02:28] the source is all in the zip file [02:29] Coderjoe: bash and ti basic [02:30] fanficdownloader.net isn't loading for me [02:30] yes but i cant really read python so if uve got it go here fanficdownloader-4.0.6/fanficdownloader/adapters/adapter_fanfictionnet.py [02:30] bsmith093: what's the difference between the links in linklist and "full urls"? [02:30] default format of the fanficdownloader python in a zip file is epub [02:30] fanfictiondownloader.net [02:31] not fanfic [02:31] it can also do html or txt [02:31] oh, help for it seems to say just epub or html [02:31] derp, nvm. "text or html" [02:32] though I would prefer to call out to wget-warc or something else that packs a warc [02:32] linklist has these http://www.fanfiction.net/s/5192986 [02:32] the script currently wants these http://www.fanfiction.net/s/5192986/1/A_Fox_in_Tokyo [02:32] even though it splices out the story id anyway?! [02:33] yeah it shouldn't need those [02:33] bsmith093: wait so it complains if you don't put in the story id? [02:33] no it complains if you leave off the title like this http://www.fanfiction.net/s/5192986/1/A_Fox_in_Tokyo [02:34] the thing after the id it wants that [02:34] which is arbitrary, and not at all sequential [02:34] bsmith093: try putting a placeholder thing there. like "foo" [02:34] if it just strips it [02:35] npe chokes [02:35] it reads the full url and loads it [02:35] so that wont work [02:36] fanficdownloader-4.0.6/fanficdownloader/adapters/base_adapter.py", line 166, in _fetchUrl [02:36] raise(excpt) [02:38] hmm [02:46] bsmith093: do you want epub or html? [02:46] i suppose for future proffinging purposes not to mention formatting html would be best [02:47] ah [02:47] well [02:47] i'm not getting the error you're getting for some reason [02:48] all of these link formats work for me with fanficdownloader-4.0.6: http://www.fanfiction.net/s/5192986/ http://www.fanfiction.net/s/5192986/1/ http://www.fanfiction.net/s/5192986/1/A [02:49] hmmm... [02:49] i'm doing this basically python /home/arrith/bin/fanficdownloader-4.0.6/downloader.py -f html http://www.fanfiction.net/s/5192986/ [02:50] bsmith093: pastebin all the output downloader.py gives you [02:50] says stuff like this for me "DEBUG:downloader.py(93):reading [] config file(s), if present" [02:54] Coderjoe: have you seen anyone using MHT stuff? or is it not that good compared to WARC? (as in this: https://addons.mozilla.org/en-US/firefox/addon/mozilla-archive-format/ ) [02:56] arrith: here http://pastebin.com/XhecfW5M [02:57] bsmith093: hmm that's pretty odd. at first glance it almost looks like fanfiction.net blocked you [02:58] i knew this would be more complicated than just a simple mirror [02:59] bsmith093: can you go to the url for the fanfiction in a browser or with wget or curl? [02:59] just thinking of that hold on [03:00] yes but it saves the page as index.html, and with all the iamges and other things [03:00] yeah [03:00] huh, but it lets you. interesting [03:01] but we could run the linkost throught wget and stripout the 404s right [03:01] but would they even be 404?, damn this is hard [03:02] a good site would have them as 404s. but yeah, wget has a good option for that: --spider [03:03] that's a good idea since wget i think would be much faster than this python script [03:03] bsmith093: btw when you said this earlier, were you talking about fanfiction.net? " even with the useragent workaround, stopped at 300Kfiles, and i know there are at least 2million stories" [03:04] oh yeah, but how do i gt that to make a list in a file of the storylinks, sorry for being this helpless, its just wget scares me with its many switches and arcane syntyax [03:04] yes i was [03:04] hold on ill lok in the bash history for the command [03:05] bsmith093: np, i like helping. just i hope at some future point you'll look back over the commands and try to understand them [03:06] wget-warc -mcpkKe robots=off -U="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.fanfiction.net [03:06] well, using them this much is helping :) [03:07] yeah. basically i know what i know due to lots of little jobs. one day i should sit down and read official documentations beyond the manpage but ehh.. not today [03:07] me too [03:07] bsmith093: ah, so you were trying to dl ff.net using wget-warc [03:07] might be why ff.net wouldn't like you that much :P [03:07] yeah, use a sledgehammer... [03:08] to drive a screw! [03:08] sure, after much cursing you'll get it in there [03:08] but is that really what you need to do? [03:08] chronomex: wher've you been all this time [03:08] when? [03:08] good to have you in the conversation :) [03:08] * chronomex waves [03:09] trying to dl fanfiction.net [03:09] so I see [03:09] any suggestions we're onto wget finally, but its tricky [03:10] dang, ff.net doesn't give a 404 for nonexistent stories [03:10] could we parse the page for story not found [03:10] yeah, gotta do that [03:10] or is it faster to grab anyway, then parse later [03:10] bsmith093: but that involves dling the page, which is more bandwidth ff.net has to suffer [03:11] wget has a random wait option [03:11] oh. all depends on what you want to do. could do it both ways. i tend to like grabbing then parsing later, but i figure diskspace is cheap [03:11] cant remember th e switch [03:11] --random-wait :) [03:11] hey will -spider tell us how big it is [03:12] random wait is more about not blocking wget than saving the host bandwidth [03:12] ah interesting thought [03:12] cause that would be good to know [03:12] doesn't seem to. just says "Length: unspecified [text/html]" [03:12] also on the archiveteam, see where tsp got to, this was apparently his baby [03:12] i'm just looking at the output of this btw: wget --spider http://www.fanfiction.net/s/9999999/ [03:13] sorry the archiveteam *wiki* [03:13] and [03:14] np, i got that. only mention i find says "Tsp is attempting to archive the stories from fanfiction.net and fictionpress." on http://archiveteam.org/index.php?title=Projects [03:14] plus that might not necessarily grab all the chapters, either. [03:15] been a while since I've seen Teaspoon around [03:16] bsmith093: what won't get all the chapters? wget? i figured we're just using wget (or curl) to check if the story exists, then feeding a list of stories into download.py [03:16] dang, would be nice if that guy had some writeup somewhere on his progress [03:18] oh yeah, the python script [03:18] i know, right? [03:19] bsmith093: i gotta go for a bit for dinner. i'll bbl. next step i see is to get curl or wget to go over the linklist and sort them into a known good (and maybe known bad) list. i'll help with that if you need it when i get back [03:20] ill look ove the wget man pages to see i fwe cant just sane eveeryhting as a uniwue name, and sort later [03:20] Coderjoe: you still here [03:20] ??? [03:21] chronomex: [03:21] alard: \ [03:21] what, hi [03:21] I'm tending a makerbot. [03:21] oohh , yay for you!] [03:21] so I'm here, just not watching irc [03:22] can curl parse html, for a certain string [03:22] no [03:22] ( i know nothing about it) [03:22] curl does not look at what it downloads for you [03:22] cause i have a massive linklist for ff.net and most of them probably dont exist [03:23] ffnet returns html with story not found if that id doesnt exist [03:23] sounds like you'll have to write a custom-ishspider [03:24] ugh [03:25] any ideas on code thats already done part of something like this? [03:26] alard: u seem to write most of the scripts archiveteam uses, any ideas o n saving ff.net [03:38] For some reason heritrix doesn't really listen to my parallelQueues = 15 setting. It's just running one queue [03:39] From what I remember of the presentation at IA, you can't spread the same domain into multiple queues [03:43] huh. [03:43] Also, you gamepro guys [03:43] Make sure you're getting the articles, there are a lot of interstitials [03:43] server-side interstitials? [03:44] any suggestions for things that might work for ffnet [03:44] The two I got were meta redirects [03:44] so, yeah [03:44] o ok [03:44] bsmith093: not off the top of my head [03:44] but it's definitely something I'm interested in [03:48] hey do invalid links have identical md5sum [03:48] doesnt really solve the bandwidth load issue, but it would help with weeding [03:48] bsmith093: sorry, I was trying to get some work done [03:49] np, we all have lives ;) [03:49] whoops wrong emote [03:55] back [03:55] hey [03:56] bsmith095: hey [03:56] so i was looking and i cant find anything to parse a webpage, which is odd [03:56] ah [03:56] checking for a thing existing or not shouldn't be hard. just grab the page then grep it [03:56] well there's a bunch of python libraries to do it 'properly' but i just use grep and exit codes [03:57] yeah but again the huge bandwidth issue [03:57] and i have absolutely no idea how u guys do it, but id love to paralellize this problem, how much space could 20 billion words possibly take up? [03:57] bsmith095: oh, ehh. well at least with wget i wasn't really able to find a way to get it to report page size [03:58] do they have same md5sum that ould help [03:59] yeah that's one way. but i'm pretty sure a grep would work fine. it wouldn't be futureproof but it'd get the job done at first [03:59] bsmith095: but wait, so did that one wget-warc download a decent amount of ff.net? since you said it got up to 300k or something? [03:59] yeah but it was beat t hell with hmtl, and css and ads and things, plus i needed the space so i dumped it [04:00] ah [04:00] bsmith095: i was just wondering what kind of error you ended up getting since that's when ff.net might've blocked you [04:00] btw about 20 billion words, assuming 5 characters per word and a following space: 20 billion * ((5 bytes) + (1 byte)) = 111.758709 gigabytes [04:00] holy christ, thats a lot! even for now, wehere u can buy terabytes like bread [04:01] heh [04:01] well compression goes a heck of a long way though. bzip or gzip should do a small fraction of that [04:01] btw thats the estimate of how big ff is [04:01] for text at least [04:01] ah [04:01] well i'd say you could get that down to maybe a few GB, probably less [04:01] with compression [04:03] can i compress than immediately dump the uncompressed, without completely killing my slighty overworked hardrive? [04:03] is there an app for that, cuz it would rock [04:05] bsmith095: that'd just be part of a script [04:06] like have the download.py grab the file, then compress it right after [04:06] man, i have *got* to learn scripting :D [04:06] it'd compress individually which won't save as much space, but you can recompress them all in batch after it's done [04:07] well i can put together a small thing in bash that'll get this job done. then you can learn python and make it all fancy :) [04:07] thatll work [04:07] one thing you gotta figure out is why download.py isn't working though [04:07] hey does amaozn ec2 have a trial i could completely kill one of their instances so my porr laptop doenst have to [04:09] good god ff.net are pricks. [04:09] Disallow: / [04:09] User-agent: ia_archiver [04:09] i figure i could dump the links into wget , have it name the files based on the id# then grep for story not found [04:09] use wget [04:10] one sec [04:10] Coderjoe: I'm not surprised, given the fanfic people I've known. [04:10] would amazon ec2 let me use them for this? [04:11] there's so much shady shit on ec2 [04:11] bsmith095: might be. but don't run it on anything you need for business incase it does get shut down [04:11] just pay your bill and don't run a botnet. [04:11] meaning what? [04:11] they'll let you do a lot... but you will probably wind up paying a bunch in bandwidth and instance [04:12] ugh, bandwidth again, its 2011, nearly 2012, i thought we were past this! [04:12] (my bill for nov is $255.62) [04:12] not in the server market [04:12] for what kinnd of usage [04:12] general rule is "sender pays" [04:13] can this thing be paralellized easliy [04:14] bsmith095: yeah [04:14] when we did the google video stuff we just put together chunks and people claimed the chunks [04:14] one person does, 1-20,000, another does 20,000-40,000, etc [04:14] er 20,001 [04:15] well in have that 300mb file i could pass around [04:15] yeah. well i think there's some kind of script already for delegating stuff that was mentioned earlier [04:15] i was gonna look into that [04:15] " integrate it with the scripts we used for splinder/mobileme/anyhub" [04:16] whatever those are [04:16] $97.99 in instance charges (one free micro, 245 hours of an m2.xlarge, 194 hours of an m1.large), $64.26 in s3 (stashed some grab stuff in there to get it off an instance. the storage was cheaper than the bandwidth out.), $93.37 in data transfer (873.951GB in for free, 15GB out for free, 778.049GB out for $0.120/gb) [04:16] therea a repo for them [04:17] i would much rather do parallelization with a full clean script and a tracker that hands out chunks of a few stories [04:17] whoo! thats cheap but not super cheap [04:17] and I was using spot instances for those two non-free instances [04:17] User-agent: ia_archiver [04:17] Disallow: / [04:17] I wish they would just disobey it [04:18] I mean [04:18] Archive teh site regardless [04:18] when is ff.net going down? [04:18] but if the robots.txt blocks it, just don't make it public [04:18] the story links are fanfiction.net/s/0000000 through 9999999 [04:18] Coderjoe: i don't think it is. i think this is just pre-emptive [04:18] Coderjoe: Pre-emptive afaik [04:18] Coderjoe: its not that i know fo, im being proactive [04:19] this is worse than geocities, mostly b/c the "creative, irriplaceable stuff" wuotient is much higher [04:19] quotient, can u tell im typing byt the light of my monitor [04:19] bsmith095: Why not iterate through every combination? [04:20] btw was it determined if the geocities effort got all of geocities or were some sites lost? [04:20] underscor: each story has chapters which are on separate pages [04:20] we could and i was going to, but thats 10 million links, most of which are non existent story wise, but which give back a page saying stroy not found [04:20] arrith: i think sites were lost [04:20] underscor: we've done that. we have a tool that checks each fanfiction id for chapters [04:20] and also that chapter thing [04:21] arrith: we do [04:21] ?? [04:21] bsmith095: oh yeah, sorry, i thought you knew kinda. the download.py takes just a normal link and grabs all available chapters [04:21] bsmith095: No need [04:21] one sec i'll pastebin [04:21] Just send a HEAD request [04:21] Only a few bytes [04:21] a what now? [04:21] curl -I http://www.fanfiction.net/s/7597723 [04:22] Just gets the headers [04:22] Tells you whether a story exists or not [04:22] Then you can go back later on [04:22] bsmith095: http://pastebin.com/kKpNxEBy [04:22] underscor: I HEART U [04:22] underscor: oh yeah that's what we're trying to do now. i was gonna wget then grep for "story not found", but hmm [04:22] but then you have to download the whole page [04:22] thats exactly what i was looking for!!! [04:22] -I is a lot better [04:22] Now, the interesting thing [04:22] underscor: is -I to chek for a 404? [04:23] is that it always returns 200 Ok [04:23] Nope [04:23] It just sends you the ehaders [04:23] But a valid story will have a header like [04:23] Cache-Control: public,max-age=1800 [04:23] Last-Modified: Fri, 02 Dec 2011 04:21:35 GMT [04:23] Invalid ones will have [04:23] Cache-Control: no-store [04:23] Expires: -1 [04:23] hey thanks want the linklist [04:23] ohh clever [04:23] bsmith095: Isn't it just 0-999999 [04:24] ? [04:24] see io knew the web would come up with something! [04:24] 0000000 actually [04:24] yes [04:24] i think [04:24] 7 digits [04:24] dunno if 0 works too [04:24] probably [04:24] yeah [04:24] ok [04:24] 10^7-1 [04:24] no 7 dig [04:24] 10 million [04:24] seq -w 0 9999999 [04:24] bam [04:24] we used this to gen a numberlist: echo {0000000..9999999 } | tr ' ' '\n' > file.txt [04:24] oh, that works too [04:25] ah yeah, i wasn't sure about seq. just the #bash people always say to use {x..n} over seq, forget why [04:25] because it's a builtin probably [04:25] I prefer seq though, personally [04:25] ah seq does newlines, nice [04:26] just for fun i'm "time"ing them [04:26] Yeah, that's one of the reasons [04:26] 190s probably [04:27] 0m8.544s for seq, my echo one is still running [04:27] just finished 0m39.947s [04:27] ure fast [04:27] seq is so the way to go heh [04:28] now we just have to get the damn downloader script to take id# as opposed to id# and title links [04:28] bsmith095: welll i think that's an ip issue, not the script necessarily [04:28] * underscor quietly works on his own version in bash [04:28] since it works fine for me [04:28] underscor: haha [04:28] lemme pastebin the snippets i have so far [04:28] I actually started this back in like March [04:29] haha [04:29] ohh [04:29] good to hear [04:30] pass me a valid link [04:30] actually the stuff i have is just weird stuff using grep and a thing to generate a ~350MB file of linklists [04:30] just to test [04:30] http://www.fanfiction.net/s/7597723 [04:30] anyone have a valid story link [04:31] or http://www.fanfiction.net/s/5192986/ [04:32] underscor: is your stuff in a single bash script? and are you using fanficdownloader-4.0.6 ? [04:32] output http://pastebin.com/5Q09g7xB [04:32] My stuff is a single bash script [04:32] and no, I didn't know it existed [04:32] clearly not enterprisey enough [04:33] pastebit it please? [04:33] underscor: for distributed efforts, I would prefer something like python over bash. bash relies on other processes on the system and as a result has too many variations [04:33] yes, bash scripts are pretty fragile [04:33] Absolutely, I agree [04:34] really? yours have been pretty robust [04:34] I'm not very comfortable with python though, so I'm just dicking around in bash atm [04:34] it takes work to make them robust. [04:34] (farming out to a wget process is ok) [04:34] ah [04:34] * chronomex currently scraping several million pages with ruby [04:34] learning python is something. but eh, you can keep bash pretty portable [04:35] chronomex: ruby? [04:35] bsmith095: yes, I've been using ruby lately [04:35] Ruby has a lovely http library [04:35] weeding the linklist [04:35] underscor: getting the ff.net effort as part of the scripts for mobile me and stuff to distribute the effort i think would be good [04:35] typhoeus or something [04:35] arrith: I agree [04:36] However, I am probably not your man [04:36] typhoes ?? whst [04:36] mostly due to time [04:36] underscor: is your bash stuff at a state you can show people? [04:36] underscor: I use Mechanize and Hpricot. [04:36] arrith: Not atm [04:36] I'll work on fixing it up [04:36] chronomex: https://github.com/dbalatero/typhoeus [04:37] arrith: not really. dld-streamer.sh (and my chunky.sh from friendster) relied on associative arrays. CentOS has too old of a bash. freebsd and osx have BSD userland, while unix people typically have gnu userland. and there have been bugs between different version of tools like expr. [04:37] underscor: hmmm, neat. I'm scraping single sites, though, so I don't have much use for 1000 threads :P [04:37] underscor: hmm alright. i'm not sure what you've done so i dunno if the current methods bsmith095 and i are using are the best ones [04:37] I tend to do things the "fuck the building is burning down, just get some shit written" way [04:37] so anything y'all do is probably cleaner [04:37] ^ [04:38] My bash scripts are basically the exact opposite of alard's [04:38] I do things the underscor way, unless I'm releasing it into the wild. [04:38] i'm doing "i barely know how to string this stuff together but it seems to work so w/e" [04:38] Extremely non portable, and no idiot checks [04:38] yeah, alard would know, somebody wake him/(her?) up [04:38] him [04:38] ah [04:38] He's in NL (I think?) so it's early there (?) [04:39] where are all the female geeks [04:39] Coderjoe: ah expr bugs don't sound fun, and yeah you have to avoid a lot to make a portable script. just egh, bash just seems easier to me than python [04:39] bsmith095: asleep [04:39] NL, wheers that [04:39] arrith: bash is easier to get into but harder to make work well. [04:39] python is not that hard, and you have a HUGE standard library to rely on [04:39] terrible with geography [04:39] bsmith095: the netherlands? that's in europe, silly [04:39] ah yeah the web is global, i forgot [04:40] .... [04:40] i once got into an argument with my dad that u cant go east to get to russia, im like no wait thats the other side of ... oh wow duh :D [04:41] been looking at maps too long need a glboe [04:41] globe [04:41] yeah dang, i gotta learn python. and go through all the grueling hours of relearning how to do some simple thing [04:41] er, the problem was in grep [04:41] https://github.com/ArchiveTeam/friendster-scrape/commit/b1f5b72cd13e20d6b02c20d8fc7b2710fc816a61 [04:41] arrith: sorry for the tedium [04:41] with bash it's like you can copy and paste around a bunch, python feels like you can't just piece stuff together [04:42] Coderjoe: oh dang, grep -o. i've run into so many "-o" bugs it's not even funny [04:42] For example [04:42] This is what I'm using to test IDs [04:42] i just avoid it and do weird sed mangling [04:42] It's dirty as hell [04:42] var=`curl -s -I http://www.fanfiction.net/s/5983988|grep Last`;if [ -z $var ]; then echo "Not a story";else echo "Story";fi [04:43] underscor: i was thinking about asking you how you did that. just now i was diffing the output of various curl -Is [04:43] xow print that list to a file and were golden' [04:43] I use grep -oP all the time [04:43] it's rad [04:43] wth is the z switch [04:43] bsmith095: null-terminated [04:43] oh, in [ [04:43] bsmith095: 'man test' [04:43] It's "if it is set" [04:44] yeah if it exists [04:44] -z is string is empty [04:44] underscor: i tend to go off of grep's exit code [04:44] grep thing; if [ $? -eq 0 ]; then; stuff; fi [04:45] exit codes seem 'faster' to me [04:45] underscor: what happens if the story has the word "Last" in it? [04:45] Doesn't matter [04:45] curl -I gets headers only [04:46] oh. you're checking Last-modified [04:46] Yeah [04:46] invalid stories don't have ti [04:46] s/ti/it/ [04:46] why can't people just use frigging HTML status codes. this is EXACTLY what 404 is for, and you can still have your own custom 404 page [04:47] s/HTML/HTTP/ [04:47] yes, I meant http [04:48] yeah for stories that don't exist they don't have Last-Modified, and they also have "Cache-Control: no-store" and "Expires: -1" [04:48] that's fucking retarded. [04:48] Coderjoe: seriously. ff.net not using 404s is so annoying right now [04:49] I wonder how ff.net feels about 10 million HEAD requests [04:49] lol [04:50] they should've used 404s :P [04:50] better than grabbing the full page like i was gonna do.. [04:50] Well, they'd be HEADs regardless [04:50] At least we don't have to grab the full page [04:50] oh, right [04:51] i guess i assumed whatever wget --spider does is as lightweight as it can get. i actually don't know what's in what it does [04:51] some kind of HEAD [04:52] what's the best thing like piratepad to use these days in terms of doesn't time out? [04:52] typewith.me ? [04:52] oh hmm wait actually, there's one for code [04:52] i forget its name. it's new [04:53] arrith: splinder wasn't using a status code to say "hey, we're temporarily down for maintenance". instead they redirected to /splinder_noconn.html which was a 200 [04:53] Coderjoe: ahh wow [04:54] ahh i was thinking of stypi but it doesn't have bash/sh support [04:54] ;/ [04:54] i wonder what they do support is the closest to bash [04:59] \o/ Progress [04:59] http://pastebin.com/ReqNs8TF [05:00] underscor: looks good [05:02] i hate wireless when im at the fringes, what i miss? [05:03] bsmith093: one sec [05:06] bsmith093: http://pastebin.com/1QN2tagB [05:07] oh [05:07] bsmith093: also http://badcheese.com/~steve/atlogs [05:08] forgot this channel had that [05:13] ok now pass stpryinator the valid ids and itl gram them all [05:14] bsmith093: stpryinator? [05:14] storyinator the output of the laste pastebin link [05:15] do we have a weeded list yet? [05:16] bsmith093: is storyinator something? searching the backlog doesn't show anything [05:16] the download.py thing? [05:16] check the logs the last pastebin [05:17] ohh [05:17] bsmith093: storyinator is i guess the name of what underscor is working on. it's not done yet [05:17] http://pastebin.com/ReqNs8TF [05:19] bsmith093: yeah that, it's not done yet [05:19] but here: http://paste.pocoo.org/show/515656/ [05:20] that's not to be run just as a script, but each piece kinda ran individually [05:21] bsmith093: should generate a list of good and bad IDs (lines 21-29) then you just feed the list of good IDs into the fanficdownloader (lines 32-34) [05:21] assumes you already have a nums.txt [05:23] yay u rock [05:24] see i knew this would'nt get done unless i nagged the community to get to it, and save already [05:25] 200tb wont back itself up, but this is nothing compared to mobilme, in terms opf volume anyway [05:27] thiss will probably take all night [05:32] good news is i can just dump these good id# links back into the original fanfic downloader im personally using [05:37] wheeeee [05:37] Frontpage Gotten [05:37] Let's get some metadata. [05:37] Running storyinator on id 5983988 [05:37] Title is A Different Beginning for the Cherry Blossom [05:37] Writen by Soulless Light, whose userid is 1807842 [05:37] Placed in anime>>Naruto [05:38] can i get that script [05:40] apporx 7 hrs till the id sorter is done sorting [05:41] bsmith093: Doesn't actually download anything yet [05:41] try this, fanfictiondownloader.net [05:41] throw the good story ids in there [05:42] fanfictiondownloader.com [05:42] nevermind it is .net [05:43] www.fanfictiondownloader.net [05:43] oh one thing [05:43] what [05:43] bsmith093, underscor: the download.py thing only gets the stories i think [05:43] oh okay [05:43] but on ff.net there's like author commentary, history of stuff getting posted [05:43] ummm yeah thats the idea [05:43] I'm getting reviews and a bunch of stuff [05:43] yeah [05:44] there's a lot more to the site than the stories [05:44] so i'd want to include those in a proper archival process [05:44] underscor: good to hear [05:44] well ok then you get that ill get the stories [05:44] bsmith093: heh well a backup of just the stories is still good to have [05:44] then in a scramble there's a lot less to dl [05:44] how will we re run this to update the archive [05:44] just futureproofing here [05:44] yeah that's something, periodic rerunning [05:45] merge the deltas [05:45] i didn't really put stuff into that script for that above hmm [05:45] see this is why we only bother to archive closing sites, they dont change a smuch [05:46] well, just figure out what the current latest ID is [05:46] umm how exaclty [05:47] 7601310 is the current latest [05:47] ure head thing done yet? [05:47] http://m.fanfiction.net/j/ [05:47] as of when [05:47] 10 seconds ago [05:47] It's whatever's on the top of that page [05:48] see this is what im talking about, we'll always be behin dthis site [05:48] underscor: and every number up to that is known used? [05:48] lots of skipped [05:48] hm [05:48] No [05:49] But it's easy to have something check that page every 5 minutes [05:49] well, if you use wget-warc, with a cdx file, you can have it save the updated page [05:49] arrith: run ure own script youll see there are a lot of holes in the seq [05:49] bsmith093: ah [05:50] you'd need to check that page of new stuff pretty rapidly to make sure you don't miss anything [05:50] not really [05:50] Yeah [05:50] once every 10 minutes is probably sufficient [05:51] or less often [05:51] since the only other way i can think of to get new chapters is to go through story IDs to check for any new ones, then check all working story IDs for new chapters [05:51] again, not to beat a dead horse, but this is a huge, popular, currently active website [05:51] yeah [05:51] you just have one worker that checks it periodically, checks between the last-known-max and the latest on that page, and notifies the tracker [05:51] oh [05:51] and thats another thing can somebody please code up a tracker? [05:52] right if it's always sequential. for a second i thought it was random, nvm [05:52] so isit seq [05:52] the holes are from deleted stories [05:52] does that updating page include reviews/author comments/user comments/etc though? looks to me like it's just new stories [05:53] thats a lot of deletions any hope of recovery? [05:53] ia waybac maybe? [05:53] ia doesn't archive it [05:53] arrith: nope [05:53] they block IA, remember [05:54] WTH not?! [05:54] they block IA, remember [05:54] so use googlebot, well its too late now, but still! [05:54] all the more reason to have an ongoing mirror [05:54] http://www.fanfiction.net/robots.txt [05:54] which afaik means periodic respidering [05:54] for new comments, etc [05:55] yep [05:55] and periodic dumps, like for wikipedia, but actually GOOD [05:55] yeah i recently saw someone talking about looking for directions on how to setup a wikipedia dump and was having a bit of trouble. i dunno how easy it is but it didn't sound fun to me [05:56] http://b.fanfiction.net/atom/j/0/2/0/ [05:56] "updated stories" in a nice rss feed [05:56] see what a kick in the pants can do for productivity, none of this,(afaik) was happening 6 hrs ago [05:56] Coderjoe: ah nice and structured. gj [05:57] ffnet is "fully automate" [05:57] bsmith093: ehh underscor was doing some stuff i think technically [05:57] thats why i said afaik [05:57] and here's new stories: http://b.fanfiction.net/atom/j/0/0/0/ [05:58] yay, an atom feed we can scrape that! [05:58] and wherever that one guy left off. if he left a record [05:58] he didnt [05:58] underscor: did you track how far Teaspoon / tsp got on ff.net? [05:59] Nope, sorry [05:59] SketchCow: if ura still up, any thought/input/ constructive critisisms [05:59] reviews are under /r/ instead of /s/ [05:59] http://www.fanfiction.net/r/7573167/ [05:59] thats useful go, automation! [06:00] does the r match the s for the same story [06:00] underscor: np. eh well, probably not that far [06:00] same # [06:00] yes [06:00] yes [06:00] whhoohoo ! so easy then [06:01] there are also communities and forums that need archiving [06:01] are those braindead simple url too [06:01] communities are not [06:02] nor are forums [06:02] oy well cant have everything [06:02] oh wait, I can, GO ARCHIVETEAM! [06:02] Coderjoe: you wouldn't happen to have found an atom/rss feed for reviews have you? [06:03] since the more structured the less darkarts html parsing [06:04] they might be on that update feed [06:04] (the first one, /atom/j/0/2/0) [06:04] best part about this script is its much lighter pon my cpu and disk io [06:04] it will list the story, and I think a new review might cause it to go on that [06:05] I know it does chapters [06:05] just passed 0006000 [06:06] nope. this story has one review posted a couple weeks ago [06:06] but is listed on the update feed with an update date of today. I think they posted a new chapter [06:06] see if people wouldn't write so much we'd have less work :) [06:09] its 10927est so off to bed for me, not leaving thought will be asleep [06:11] quick thought here it would be great if once we have everything eatch out for ompleted sotries and pull them from the scrape queue [06:11] completed stories and pull them out [06:12] gnight [06:13] still want to scrape for reviews. and there is nothing that says the author can't revise something [06:14] yeahh i was thinking author edits [06:14] and author comments [06:14] i don't have experience with this kind of stuff but i'm hoping the header last modified is accurate in this case [06:15] bsmith093: you're checking for existing stories? [06:15] since if fanfiction.net is blocking him ( bsmith093 ) then he'd just get a big list of false negatives ;/ [06:16] Do new comments change the "update date"? [06:16] no [06:16] ok [06:16] at least I don't think so [06:17] that's something I haven't specifically checked. I did find that the stories I looked at had a newer update date than the last review [06:20] underscor: for ease of notation, you can have the 999 stuff like this: seq -w 0 $((10**7 - 1)) [06:20] or $[10**7 - 1] [06:20] bleh [06:20] hard for me at least to visually see how many 9s there are [06:20] seq was another compatability issue [06:20] i have 87 episodes of crankygeeks [06:21] Coderjoe: ohh yeah. trying to shoehorn seq stuff into jot on osx [06:21] i'm just getting ipod format since its only 105mb after episode 70 [06:22] mostly so i can fit 40 episodes onto a dvd [06:23] At that size, don't you mean approx 47? [06:23] 4.3 for gibibytes [06:23] about [06:23] more like 43 [06:23] Ah. [06:23] I have a 4.7 GB DVD-R here. [06:23] some are 115mb [06:23] 4.7 uses the base 10 'gigabyte' harddrive mfw scammers use [06:24] Also, why is this not changing nick to NotGL- [06:24] 4.7 gigabyte to gibibyte= 4.3772161006927490234375 gibibytes [06:24] Ah [06:25] underscor: you gotta have a github repo called "DON'T LOOK HERE" then just secretly push stuff like you ff.net work ;P [06:26] <*status> | STR_IDENT | 1 | Yes | irc.underworld.no | NotGLaDOS!STR_IDENT@ip188-241-117-24.cluj.ro.asciicharismatic.org | 3 | [06:26] I don't get it. [06:27] ...wait, is my nick NotGLaDOS? [06:27] it is [06:27] yes [06:27] ...damn quassel playing tricks on me [06:28] And fixed, with hackery. [06:28] And fixed, with hackery. [06:28] try /nick YourNewNick [06:28] I had to force a module to send a false IRC command on the in direction, because it was displaying my nick as STR_IDENT [06:29] Wheee [06:29] Even farther! [06:29] http://pastebin.com/MWsp8Fv3 [06:29] underscor: progress? [06:29] arrith: Yep :) [06:29] ahh pretty nice [06:30] underscor: at this point are you echoing extracted data? [06:30] or is there stuff it's doing on the bg that isn't echoed? [06:30] story ID number in xml? [06:31] (yes, you have the directory, but the number in the xml can help make sure that it can be correlated if separated and/or renamed) [06:33] arrith: Everything its doing is echoed [06:33] Coderjoe: Good idea, added [06:33] it's* [06:36] is there arepo yet> [06:36] underscor: what do you use to deal with xml in bash? [06:36] bsmith093: kind of underscor's pet project at this point i think [06:36] yeah [06:36] xml handling is done in php [06:36] bsmith093: also, you might want to doublecheck that you're getting some good nums and not all bad [06:37] ohh [06:37] underscor: cheater! :P [06:37] $f = create_function('$f,$c,$a',' [06:37] $xml = new SimpleXMLElement("<{$root_element_name}>"); [06:37] foreach($a as $k=>$v) { [06:37] function assocArrayToXML($root_element_name,$ar) [06:37] { [06:37] if(is_numeric($k)) [06:37] $k="v".$k; [06:37] if(is_array($v)) { [06:37] $ch=$c->addChild($k); [06:37] $f($f,$ch,$v); [06:37] } else { [06:37] $c->addChild($k,$v); [06:37] } [06:37] }'); [06:37] $f($f,$xml,$ar); [06:37] return $xml->asXML(); [06:37] } [06:37] yeah i think im getting false ngs [06:37] bsmith093: are you getting any positives? [06:38] random check [06:38] arrith: as in any in goodlist.txt? [06:38] 0005543 [06:38] underscor: ah, not too complicated [06:39] underscor: could probably rewrite that in bash.. [06:39] in goodlist but not there in firefox [06:39] WHAT THE HELLO HI [06:39] check please [06:39] OK, my internet is back. [06:39] hey hey hey its SkeeetchCow [06:39] Jesus, lot of backlog. [06:39] I know. [06:39] I looked a the channel and thought "Fuck it." [06:40] fanfiction.net/s/0005543 [06:40] I can't do that. [06:40] So bsmith093 is trying to save fanfiction? [06:40] bsmith093: yep dang. that is a false positive [06:40] well me and underscor [06:40] so its not just me [06:40] SketchCow: bsmith093 really wants to, underscor is doing a lot of stuff, i'm poking around with parts of it and Coderjoe is helping [06:40] So my question is, what's going on? [06:40] It's shutting down? [06:40] No [06:40] SketchCow: no, all pre-emptive [06:40] ffnet seq id check [06:41] Premptive [06:41] OK. [06:41] So remember, if it's pre-emptive, don't rape it. [06:41] if it was id make sure google knew [06:41] SketchCow: ofc [06:41] That's all I can really contribute. [06:41] Looks javascript free. [06:41] I like raping sites though :( [06:41] SketchCow: they have a pretty light blocking trigger finger i think. at least they blocked bsmith093 somehow for some reason i think [06:41] Have we considered just pinging them? [06:41] Going HEY WE WANT A COPY [06:41] Or no [06:41] and it has atome feeds for everything ! whooA! [06:41] not sure if anyone thought of trying that [06:41] They block IA [06:42] so they're a hostile target [06:42] [06:42] i keep saying use googlebot [06:42] yeah, i think based on blocking IA people figured to not try [06:42] bsmith093: No, I mean [06:42] They block the wayback machine [06:42] It doesn't actually stop us [06:42] so switch the useragent [06:42] ??? [06:42] one guy, Teaspoon / tsp i guess worked on this a little while ago but he hasn't been seen and people haven't seen how far he got [06:43] bsmith093: What do you mean? Wayback Machine isn't going to switch its useragent... [06:43] They obey robots.txt for legal reasons [06:43] bsmith093: they haven't blocked underscor's efforts afaik [06:43] I'm up to 71k [06:43] bsmith093: so he doesn't need to change his UA, at least not yet [06:43] underscor: ohhh, ok then that makes much more sense [06:43] (just checking IDs [06:43] ) [06:44] i suppose that so they can say well we didnt save u cause u blocked us [06:44] underscor: are you still doing that check for "Last" in the header? [06:44] Yeah [06:44] underscor: since i think bsmith093 just found a false positive, he checked [06:44] here [06:44] underscor: http://www.fanfiction.net/s/0005543/ [06:45] still dead for me [06:45] oh wait [06:45] i might've mixed up dead and not dead [06:45] Not a story [06:45] var=`curl -s -I http://www.fanfiction.net/s/0005543/|grep Last`;if [ -z $var ]; then echo "Not a story";else echo "Story";fi [06:45] bsmith093: goodlist is badlist and badlist is goodlist [06:45] Doesn't trip up mine [06:45] ure kidding me!? [06:45] lol [06:45] bsmith093: just rename them when it's done :P [06:45] checing bad list then [06:45] i blame the error codes [06:46] 0009863 [06:47] its good u did reverse the polarity [06:47] yay tom baker [06:47] bsmith093: yeah looks like a story [06:47] heh, yeahh [06:47] i did say i didn't check the script :P [06:47] stop and switch or keepgoing [06:47] 77000 now [06:48] underscor: what are those numbers u keep giving [06:48] id's I've checked up to [06:48] OK, this needs to go to another channel. [06:48] for story/not story [06:48] :( [06:48] #fanboys or #fanfriction [06:48] HOW R U THAT FAST? [06:48] second one [06:48] K THEN [06:48] caps [06:49] bsmith093: keep going [06:49] k [06:49] SketchCow: you are good with those names [06:49] remind me i 7hrs hen its done [07:05] reading the logs in the topic brought up a question that i didn't see answered: does archive.org archive porn? [07:06] or IA rather [07:08] ^ [07:09] SketchCow [07:09] SketchCow: does the Internet Archive archive porn? [07:09] lol [07:23] http://www.archive.org/details/70sNunsploitationClipsNunsBehavingBadlyInBizarreFetishFilms [07:23] ugh [07:24] you had to link to the one I had seen before [08:04] Why do people ask that [08:04] Since Porn simply means "Material considered sexually or morally questionable by random community standards", of course it does. [08:05] So does google and so does facebook [08:05] good [08:06] SketchCow: what irc bouncer do you use? znc or irssi maybe? [08:13] Irssi [08:14] Pumped through a screen session [08:15] yeahh. i gotta learn irssi. [08:27] It's not so bad. [08:27] I use a screen session that puts the channel list along the right, like mIRC used to. [08:27] Also, I put this on the machine that runs textfiles.com and a bunch of services. [08:27] So I know INSTANTLY if something's wrong with the machine. [08:30] well you can't really right click on things and do other gui-ish stuff that you can in xchat. i can totally see me using irssi for logging but for everyday stuff i'm not sure yet [08:35] irssi supremacy [08:41] You can if you have an ssh client that makes URLs alive. [08:41] And people? Fuck people [08:41] They're all the same [08:42] who needs to right click on them [08:45] SketchCow: are you shooting all video on dslrs? [08:46] Yes. [08:46] that became a thing pretty quickly [08:51] apparently it doesn't suck much at all. [08:51] sensor is sensor, and dslrs often have nice sensor. [08:52] and excellent glass [08:52] and more variety [08:52] I don't think anything about it sucks [09:08] Some things suck. [09:08] But they're quite doable for what they are. [09:09] I have to also point out that I was trained, at 20, to be able to unload, canister, and then reload and thread 16mm film into a set of reels, all while inside a leather bag so they wouldn't be exposed to light. [09:09] Comparitively, this new material is even better than what that was giving me. [09:10] what are the things that suck? [09:10] I know very little about video production [09:12] http://www.youtube.com/watch?v=mEdBId3OuuY [09:16] no gui at the moment [09:42] arrith: I'm using irssi in a screen session, and I'm saying you're wrong. I'm clickin' links like a darned Mechanical Turk on speed [09:43] Most terminals convert text that it things is a link, to a clickable element. PuTTY (Win/*nix), gnome-terminal, rxvt, xterm, iTerm and co all do [10:16] were passwords ever reset on the wiki? [10:16] i haven't logged in for a while and i'm having trouble. my password autocompletes but the login doesn't work [10:22] Maybe [10:22] I cleared out one-offs who did nothig [10:24] SketchCow: could you look into User:Arrith really quick? [10:24] just reset your password [10:24] dnova: i was looking for a page for that [10:24] didn't find one [10:25] should be linked on the login page [10:26] ersi: you sure it's there? might just be me but i'm not seeing anything. just "Create an account." [10:29] hmm [10:30] huh, weird. alright.. there's no special page for that [10:30] really? [10:30] well the signup doesn't have an email, usually resetting a password involves sending an email out [10:32] welp. [11:00] how did the crawls from yesterday turn out? [11:11] well until further notice i am now arrith1 on the wiki [11:17] what is python used for in the splinder download process? [12:15] testing script to download all wikkii.com wikis [12:15] WIKIFARMS ARE NOT TRUSTWORTHY. [12:17] k [12:17] indeed [12:19] btw if any Administrators get a chance, could one of them merge User:Arrith with User:Arrith1 please? [12:19] oh wait, seems only SketchCow can do that [12:20] i guess no [12:20] what nick do you want? [12:20] i mean, sysop can merge pages [12:21] user accounts i think it is impossible [12:22] if someone has a spare moment, could they put "Uploaded; still downloading more" next to my name in the splinder status table on the wiki? [12:22] emijrp: http://www.mediawiki.org/wiki/Extension:User_Merge_and_Delete says "merge (refer contributions, texts, watchlists) of a first account A to a second account B" [12:22] mm, looks like possible mergin accounts [12:22] which would be good enough for me [12:22] but are extensions, which need to be installed separately [12:23] not sure if jason is going to install it only for you [12:23] : ))) [12:23] emijrp: http://archiveteam.org/index.php?title=Special:Version says it's already installed :P [12:24] although i am still curious why i can't get in on my original acct [12:24] ah man, he used it to merge spam users, i forgot [12:24] dnova: done [12:24] thanks [12:25] yeah on that topic, there are a lot of pages with {{delete}} [12:25] and by a lot i mean more than 5 heh [12:27] we still need splinder downloaders [12:28] why my userpage on AT wiki has 8000 pageviews? [12:28] emijrp: it's a very nice page [12:28] btw, the time stamp on the latest archiveteam.org wiki dump at http://www.archiveteam.org/dumps/ is 15-Mar-2009 :| [12:28] hardly "weekly" [12:31] http://code.google.com/p/wikiteam/downloads/list?can=2&q=archiveteam [12:31] i do weekly, but i dont upload them [12:31] aha, didn't think of wikiteam but now that i think about it that makes sense [12:31] emijrp: images though? [12:31] i dont upload images to googlecode [12:32] only 4gb of hosting [12:32] http://www.archive.org/search.php?query=wikiteam%20archiveteam [12:32] hm. i'm not sure where a good host would be. my first thought is archive.org [12:33] hmm that's good but, are those maintained? seems to be from August and July [12:33] http://www.referata.com/ another wikifarm unstable [12:33] what is a wikifarm [12:34] free hosting for wikis [12:34] oh. [12:36] referata is for semantic wikis [12:36] cool stuff [14:06] downloading it:pornoromantica [14:16] wtf is that [14:19] I sure do not know [14:45] SketchCow have you seen this filter for eliminating aliasing in 5Dmk2 video? http://www.mosaicengineering.com/products/vaf-5d2.html [15:02] rude___: That's rad! [15:06] yup, funny that I've never noticed aliasing that bad in 5Dmk2 video before seeing the demos [15:06] http://www.m0ar.org/6346 [15:06] This is amazing [15:06] (it's not actually porn) [15:11] pornoromantica is up to 29mb. [15:16] wtf is that [15:17] someone's splinder account [15:17] up to 32mb now [15:20] underscor: how long do I have to watch before it gets amazing [15:21] like 2 mins [15:21] right after she says "I think he wants to fuck me" [15:21] ah, there [15:21] and... ? [15:23] emijrp: ? [15:24] she says that and what happens? [15:24] oh. I have no idea. [15:24] I'm an archivist. [15:24] I'm confused. [15:24] he archives the shit out of her [15:24] WHAT. [15:27] emijrp, download some splinder [15:27] these last bunch of profiles are a real bitch [15:27] If I download splinder, who the hell is going to download wikis? [15:27] not really [15:28] you have to watch it! [15:28] I don't want to spoil it [15:35] i downloaded ~14 gb splinder, some might be unfinished. cant continue. where to put it? [15:36] ask SketchCow for a slot and then use the upload-dld.sh script [15:36] upload-finished.sh rather [15:44] ok [15:44] SketchCow: i need a place to upload splinder downloads [17:38] You got it. [17:40] Has anyone else in here gotten calls/contact from reporters wanting to do an article on archive team? [17:41] nope [17:41] me either [17:44] A fairly terrible article is going to come out, and I apologize in advance for it. [17:44] how did you find out about it [17:44] I did an interview for it. [17:45] I didn't realize who was writing it, he used an intermediary who did not identify him as the author, after I repeatedly refused to interact with him. [17:45] Now I found out and I have been yelling. [17:45] I'm good at yelling. [17:45] lol [17:46] oh, was it Talmudge and/or Schwartz [17:46] they're in your voicemail, archiving your archiving [17:48] Yes [17:49] yipdw: You know them? [17:49] sneaky bastards [17:49] not personally [17:49] I do remember seeing Mattattattattattattattattahias Schwartz's article on Internet trolling a while back, though [17:50] hahaha [17:52] I figure if he wants to invoke the pumping lemma on his name, it's fair game [17:52] * closure listens to a 1 gb WD green sata drive fail to spin up in my external dock [17:53] :( [17:53] wonder if it will do better on internal SATA.. will have to try later [17:54] huh, on one dock it does nothing, on the other I can hear the motor fail to quite spin it [17:55] Or you're torturing it and it's randomly going up and down. [17:56] us? torture hard drives? [17:56] never! [17:57] Schbirid: Need the slot! [18:01] oh good, everything on this drive is still present on some 50 or so dvds. urk. [18:01] haha [18:29] god damnit I just lost about 15 hours worth of downloading. [18:29] See, that's what you get [18:29] "ha ha you lost so much stOH FUCK I LOST MY STUFF"\ [18:29] Jesus did it to you [18:30] Jesus, he likes insta-parables these days [18:30] I wasn't laughing at closure!! well I kinda was. [18:31] argh! [18:32] Jesus knew [18:35] hahaha [18:39] haha closure and dnova lost stuff [18:40] i lost 1.5TB some months ago, so, it cant get worse [18:42] obviously, currently unique stuff is being destroyed in the Internet, what is your estimate in Megabytes? [18:43] mb/hour [18:46] Thank you for placing your order with the Comprehensive Large-Array data Stewardship System. [18:52] what's that? [18:57] I sure do not know [18:58] haha [18:58] I lost pornoromantico :( [18:58] it was over 400mb [18:58] It's the Big Brother program for fat people [19:00] hahahah [19:00] It's NOAA's tape access system [19:00] you want noaa tapes? [19:00] I need a piece of historical data for oceanography class [19:01] awesome. [19:07] Get your piece of oceanographic data http://en.wikipedia.org/wiki/Exploding_whale [19:12] I'm digging through noaa's various public FTP servers [19:12] There so much old cruft and stuff, it's really cool [19:12] TODO: Ring bob and tell him to actually upload the data here [19:12] This directory contains files related to the March 1993 Blizzard. It [19:12] includes a report on the storm and related data files described in [19:12] the report. NCDC's homepage provides easy access to this directory. [19:13] That file was updated 8/15/97 [19:13] and the data is still not there [19:13] hahaha [19:13] damnit, bob [19:15] damn, only 5 of 14gb were "finished" data [19:15] noaa has public ftp archives ?! [19:15] ftp.ncdc.noaa.gov, ftp.ngdc.noaa.gov [19:16] ftp.nodc.noaa.gov [19:18] If that FTP is up since 1997, it is almost as trustworthy as Internet Archive. [19:25] anyone want to catch me up on how we're archiving ffnet [19:25] im also in #fanfriction [19:29] looks like podtrac doesn't keep ipod format of crankygeeks after 100 episode number [19:29] likely mpeg4 is hosted by pcmag [19:30] are u grabbing all their feeds, ausio ogg, etc [19:30] *audio [19:30] no [19:30] mp3 is down [19:31] can you guys please start mirroring crankygeeks [19:31] i didn't think it was this bad yet [19:31] when i checkedd only 4 mps were dead [19:32] its only 100-103 that are down [19:32] ok [20:09] Anyone have idea to do a sed that does nothing BUT A-Za-z0-9 ? [20:09] What's a sed? [20:10] Found it. sed 's/[^a-zA-Z0-9]//g' [20:12] looks like all links to 101-103 are died [20:12] for crankygeeks [20:13] SketchCow: alternatively grep -Eo '[a-zA-Z0-9]' might do what you want [20:23] I would move ./Amiga Dream 01 - Nov 1993 - Page 32.jpg to AmigaDream01-Nov1993-Page32.jpg [20:23] I would move ./Amiga Dream 01 - Nov 1993 - Page 67.jpg to AmigaDream01-Nov1993-Page67.jpg [20:23] I would move ./Amiga Dream 01 - Nov 1993 - Page 22.jpg to AmigaDream01-Nov1993-Page22.jpg [20:23] I would move ./Amiga Dream 01 - Nov 1993 - Page 15.jpg to AmigaDream01-Nov1993-Page15.jpg [20:23] I would move ./Amiga Dream 01 - Nov 1993 - Page 45.jpg to AmigaDream01-Nov1993-Page45.jpg [20:23] Tah dah. [20:25] SketchCow: did you read my suggestion for human ocr at IA ? [20:25] That you think the non-human OCR sucks and it should be replaced? [20:25] Was there more to it? [20:26] replaced with a colaborative ocr, the technology for that exists [20:26] I see. [20:27] i think i read about IA saying their books have OCR for blind people [20:27] but... checking the .txt files on scanned books, O_O [20:32] IA has a great potential, but most of its content is in useless formats [20:33] I dont want to read a book in JPG/DJVU, I want an epub or a correct txt [20:33] You strike at the heart of an endemic issue with IA. [20:34] I need to do a quick errand. [20:34] But it's a big issue and there may not be an easy solution. [20:34] It's a political technical issue [20:46] From Wikipedia: Metapedia is a white nationalist and white supremacist,[2] extreme right-wing and multilingual online encyclopedia.[3][4][5] [20:46] And it is a wiki. [20:46] What is your opinion about archiving that? [20:46] Discuss. [20:47] archive it [20:48] archive it [20:54] Is it illegal to upload that into IA? [20:55] I think it depends on servers location. [20:55] why would it be illegal? [20:56] Speaking in positive tone about nazism is illegal in some jurisdictions. [21:21] Download it. [21:21] archive it. [21:51] http://arstechnica.com/gaming/news/2011/11/gamepro-magazine-and-website-to-shutter-next-month-1.ars [21:51] emijrp: there is no law against naziism in the United States of America [21:52] December 5th [21:52] "Congress shall make no law ... abridging the freedom of speech, or of the press ..." [22:44] i'm using wget-warc to backup crankygeeks web site [23:07] for the ffnet archiving effort, where the list of good id#s? [23:08] Will archiveteam backup GamePro, Waves on Google Wave or Knol? [23:08] And did you guys backup Aardvark? [23:09] is fanfiction.net going down? [23:11] godane: not, preemtive [23:11] ok [23:13] arkhive: I've heard some buzz about Knol. I haven't heard much about Wave. Are you interested in starting a subcommittee? [23:13] can wget-warc sed out the main website? [23:13] id like to help with knol, if theres a script, cause the button option sounds reallyreally slow [23:13] i want it to work locally and be possible to host it localy [23:15] godane: the purpose of WARC is to preserve as close to the source material as possible. as such, altering a page before storing it into the WARC file is to be avoided. wget can --convert-links, but I don't know how this affects the .warc output. [23:15] chronomex: --convert-links is safe to use with warc. [23:16] arkhive: we have the metadata for knols, 700,000, now a scraper for the real content is needed [23:17] chronomex: I'd like to help back Knol up. [23:19] emijrp: can we write a script and have a server tell each connected client what knol to fetch. Like we did for backing up Google Video's Videos? [23:20] I'm not too good on writing scripts though. [23:22] im not sure, i dont know how people make that cool distributed scripts [23:22] alard: excellent. [23:23] wget-warc is not converting links [23:23] since everyone seems to be in here anyway, wheres the list of good id#s for ffnet [23:23] i still get http://www.crankygeeks.com/favicon.ico [23:23] like links [23:23] -k [23:23] i did [23:23] index.html download [23:24] and i still get those links [23:24] wget-warc -mcpk [23:25] arkhive: channel is #klol [23:27] i'm still getting http://www.crankygeeks.com/favicon.ico links [23:27] i also don't want it redownloading everything [23:28] wget "http://www.crankygeeks.com/" --warc-file="crankygeeks" --no-warc-compression -mcpk [23:28] thats what i used [23:28] my wget-warc is wget [23:29] GamePro's down December 5th...We should also start that. I've got storage and computers I can dedicate to backing it up. [23:30] how big is gampero [23:30] gamepro [23:30] and what is it [23:31] A gaming blog, news, magazine website [23:31] Not sure How big Gamepro's site is. Not sure how to check either. [23:31] delicious was archived? [23:32] godane: Be aware that compressing warcs is preferably done while wget is downloading, not with a post-processing gzip step. So if you do intend to gzip later, it's better to remove the --no-warc-compression [23:33] alard: i want to make sure i don't get the same problem i have now [23:33] if i compressed i will not know if it did corrently [23:34] You can gunzip? [23:34] not until i'm don't download [23:34] *done [23:34] hmm [23:34] guess I'll snag a copy of http://wikileaks.org/the-spyfiles.html [23:35] i'm not downloading 200mb website to find out it didn't convert the links [23:37] yipdw: thanks, now this channel is being monitored by CIA, NSA and ETs. [23:37] ET? really, cool :) [23:37] emijrp: bomb, bin Laden, airplane [23:37] godane: You should do what you think is best, of course, but: 1. you can gunzip the warc.gz while downloading (it'll just print an error at the end, but you will see what's been downloaded so far) 2. bear in mind that the wget link conversion is always done at the end of the download, if I remember correctly. [23:37] oh [23:37] correct [23:37] anyone know of tools for scraping MediaWiki sites? [23:38] godane: just do gunzip -c my.warc.gz | less [23:38] balrog: I call them WikiTeam tools. [23:38] balrog: wikiteam produced some good tools. look in the archiveteam wiki [23:39] alard: i may never finish download cause it keeps downloading blank previos pages [23:39] awesome. thanks! [23:40] just to go back to the last 15 episodes at over 1500 even when there is only 237 episodes [23:40] balrog: which wiki? [23:40] none that Archive Team would be interested in. it's not going away but I need a backup for various purposes [23:40] ATTENTION: This wiki does not allow some parameters in Special:Export, so, pages with large histories may be truncated [23:40] any way to get around that? [23:42] only affects for +1000 revisions pages [23:42] not the common case [23:42] ahh, yeah shouldn't have any of those here at all. [23:42] balrog: I think you'll have to upgrade the wiki to fix that, but it's rare to cause problems. [23:42] :) [23:42] it's not a wiki I have access to. [23:43] but yeah shouldn't have +1000-rev pages. [23:43] underscor: do u have the list of valid ids for ffnet [23:43] i really HATE mirror websites now [23:44] there just always keep remirror the full site when i just what the update crap [23:44] like xkcd.com [23:44] httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo -n --disable-security-limits -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -#L500000000 --update xkcd.com [23:45] thats what i did [23:45] i have an xkcd images script if anyione wants that [23:45] i thought the --update will not fucking redownload files [23:46] why backup xkcd? [23:46] the whole thing? [23:46] again [23:46] i want to local host things on my local lan [23:46] wget-warc -mcpk --random-wait xkcd.com [23:47] but that will not download imgs.xkcd.com i think [23:47] http://arxiv.org/ is a great site but they have counter-archivists measures [23:52] looks like stupid imgs.xkcd.com can't be mirrored with wget-warc [23:53] if u want imgs.xkcd.com here google this yaxkcdds.sh