#archiveteam 2011-12-02,Fri

↑back Search

Time Nickname Message
00:08 🔗 Paradoks Syncronet is still being developed. I re-learned this in the past few days while reminiscing about my old BBS software ( http://tech.groups.yahoo.com/group/KBBS/ )
00:08 🔗 Paradoks Err, clients. Whoops.
00:09 🔗 Paradoks I figured people just used telnet.
00:14 🔗 Coderjoe if net connected, usually just telnet or the like
00:14 🔗 Coderjoe (unless you want to connect to a system with RIP or some custom graphical client program)
00:15 🔗 Coderjoe (mmmm RIPTERM... something I only used twice or so)
00:20 🔗 bsmith093 still need phone #s
00:20 🔗 bsmith093 \??
00:21 🔗 Coderjoe no idea where to get current numbers
00:24 🔗 bsmith093 are there current numbers?
00:24 🔗 bsmith093 how would it work with telnet
00:24 🔗 Coderjoe you need to know a hostname or ip address to telnet to
00:25 🔗 Coderjoe miku.acm.uiuc.edu
00:27 🔗 DFJustin you can also use dosbox to redirect telnet to an emulated COM port and use old DOS terminal software
00:28 🔗 Coderjoe hahahah
00:29 🔗 Coderjoe "taking a moon lander out to do riceboy drifting out in the parking lot"
00:30 🔗 DFJustin oh it's even easier than I thought in dosbox, you just dial an IP address instead of a phone number
00:32 🔗 SketchCow Thank you, rude___
00:37 🔗 bsmith093 Coderjoe: haha very funny
00:39 🔗 bsmith093 nyancat or whtatever its called
00:41 🔗 bsmith093 although nice trick, i didnt even know the terminal could recieve color
00:41 🔗 Coderjoe xterm-color, ANSI, etc
00:42 🔗 bsmith093 SketchCow: anything besides the insanely huge mobilme to archive, something smaller? perhaps fanfiction.net? would love to help with that :) tried wget -mcpk, didnt get all of it though, weird
00:44 🔗 bsmith093 even with the useragent workaround, stopped at 300Kfiles, and i know there are at least 2million stories
00:44 🔗 db48x Coderjoe: did you see that episode of Top Gear where they actually drove the real moon buggy?
00:44 🔗 Coderjoe no.
00:45 🔗 Coderjoe i want to now
00:45 🔗 db48x I recommend it :)
00:45 🔗 db48x it's the one they developed for a future moon mission, that may or may not ever be used
00:45 🔗 db48x it's got a pressurized cabin
00:45 🔗 db48x 6-wheel drive
00:45 🔗 bsmith093 theres another moon mission?!
00:45 🔗 db48x full independant suspension
00:46 🔗 bsmith093 i want one !
00:46 🔗 bsmith093 coolest...moon buggy...ever!!
00:46 🔗 db48x each wheel is independantly steerable
00:46 🔗 db48x the console inside gives you diagnosics on each wheel, showing you how much power is being applied and so on
00:47 🔗 bsmith093 wait... why independently?
00:47 🔗 bsmith093 wouldnt you usually be pointing them all in roughly the same direction at any given time?
00:47 🔗 db48x the wheels might not all be touching the ground at the same time
00:47 🔗 db48x they have a lot of vertical travel so that you can go over rocks
00:47 🔗 db48x yea, in general
00:47 🔗 bsmith093 oh.. yeah... duh moon grav.. wow i feel stupid
00:47 🔗 db48x but you might want to spin in place
00:48 🔗 bsmith093 like donut, or spin actually in place
00:48 🔗 Coderjoe someone hasn't seen things like zero-point-turn commercial mowers and stuff
00:48 🔗 Coderjoe (though those go by a different means, like tanks)
00:48 🔗 bsmith093 ZERO POINT TURN?!?? for a LAWN MOWER!?! why?
00:48 🔗 db48x yea
00:49 🔗 db48x heh
00:49 🔗 Coderjoe commercial mowers. so they can mow a field faster and get more jobs in during a day
00:49 🔗 bsmith093 thats like, i have a sleep disorder, oh heres a TIME MACHINE!
00:49 🔗 db48x lol, great reference
00:49 🔗 bsmith093 oh well commercial mowers, well that acually makes sense
00:49 🔗 db48x hmm, think chapter 78 is up yet?
00:49 🔗 Coderjoe more jobs means they can get mor income
00:50 🔗 bsmith093 yeah, hpmor rocks
00:51 🔗 db48x http://www.topgear.com/uk/photos/topgear-moon-drive?imageNo=1
00:51 🔗 bsmith093 heres all of it on one convenient file
00:52 🔗 db48x I have it on my phone as an ebook too
00:52 🔗 bsmith093 neat
00:52 🔗 Coderjoe eh, what is this?
00:52 🔗 bsmith093 vague much
00:52 🔗 db48x Coderjoe: Harry Potter and the Methods of Rationality?
00:53 🔗 db48x oh, my bad. it's 12-wheel drive
00:53 🔗 db48x 6 pairs of wheels
00:53 🔗 Coderjoe no, i meant the "hpmor"
00:54 🔗 db48x yea, HPMoR, Harry Potter and the Methods of Rationality
00:54 🔗 bsmith093 ACRONYMS ARE U=YOUR FRIEND
00:54 🔗 bsmith093 stupid caps
00:54 🔗 db48x Coderjoe: http://www.fanfiction.net/s/5782108/1/Harry_Potter_and_the_Methods_of_Rationality
00:57 🔗 db48x Coderjoe: I recommend it even if you don't generally like fan fiction, but be forewarned: your laughter will annoy your housemates/coworkers
01:00 🔗 bsmith093 every story has its own unique id number , they are apparently sequential, hey come to think of it i have a fanfiction downloader, that takes link lists
01:01 🔗 bsmith093 ill just generate all possible story ids and plug them into that
01:02 🔗 db48x :)
01:03 🔗 db48x integrate it with the scripts we used for splinder/mobileme/anyhub
01:03 🔗 db48x then we can all help out
01:07 🔗 bsmith093 not really sure how, what im doing ( or trying to do) will just grab all the stories, ( hopefully) check link by link and download into a text file by category author and storyname, using this little binary blob linux app i found here, fanfictiondownloader.net
01:12 🔗 bsmith093 ok well my generator is choking, trying to pump out 10million links at once, so basically whats the command to generate these http://www.fanfiction.net/s/[0000000-9999999]/1/ note the regex im trying to use
01:12 🔗 bsmith093 0 to 9 999 999
01:13 🔗 bsmith093 probably nowhere near that many storeies but the id's are all over the place
01:15 🔗 Coderjoe wow. I never knew there was a printer acessory for the game boy
01:15 🔗 db48x heh
01:15 🔗 Coderjoe (and I had a 1st gen game boy)
01:18 🔗 db48x there was a printer available for my favorite calculator
01:20 🔗 arrith bsmith093: this is one way, but it takes forever: for i in {0000000..9999999}; do echo $i >> file.txt; done
01:21 🔗 arrith actually
01:22 🔗 arrith bsmith093: echo {0000000..9999999} | tr ' ' '\n' > file.txt
01:22 🔗 arrith should be a lot faster
01:24 🔗 bsmith093 ok but with the links around the numbers
01:24 🔗 arrith yeah. took a little less than a minute just now. generated a 70 megabyte file
01:24 🔗 arrith oh
01:24 🔗 bsmith093 sorry this things very p0icky that way
01:24 🔗 arrith yeah sure
01:25 🔗 bsmith093 i can run it from here though :) once i have the command, or if i can ever figure out regex like this for myself :)
01:26 🔗 arrith bsmith093: i dunno if generating the numbers beforehand is faster but, this is what you'd run after that last one (echo | tr thing): while read $num; do echo http://www.fanfiction.net/s/$num/1/ > linklist.txt; done < file.txt
01:26 🔗 bsmith093 i swear every script i've ever seen thats more complicated that wget her and grab this put it there, looks like chinese to me
01:26 🔗 arrith oh. well. there's that if you want/need it for inspiration
01:26 🔗 arrith yeah bash on a single line isn't too friendly
01:26 🔗 bsmith093 thanks
01:26 🔗 Coderjoe keep in mind that /1/ needs to be incremented as well until you run out of chapters
01:27 🔗 arrith oh, if so then the linklist would just have chapter1 for everything
01:27 🔗 arrith how many chapters should we look for for each?
01:28 🔗 arrith er
01:28 🔗 arrith bsmith093: the command should actually have "> linklist.txt" at the end: while read $num; do echo http://www.fanfiction.net/s/$num/1/; done < file.txt > linklist.txt
01:29 🔗 Coderjoe number of chapters depends on the story
01:29 🔗 bsmith093 on my end its still generating the file of numbers
01:30 🔗 arrith bsmith093: the "echo {0000000..9999999} | tr ' ' '\n' > file.txt" ?
01:30 🔗 bsmith093 yes
01:30 🔗 arrith ah, ok
01:30 🔗 Coderjoe holy crap. VASTLY different youtube front page, too
01:30 🔗 arrith file should be around 77MB so you can track its progress looking at that
01:30 🔗 arrith bsmith093
01:31 🔗 arrith Coderjoe: yeah but similar to how some ids will not go to real stories, one must kind of pick a default for how many chapters to look for. stopping trying to download more than 4 if say chapter 4 isn't found would be good, but that requires tool support
01:32 🔗 rude___ no prob SketchCow.. got more coming your way soon
01:32 🔗 godane did comcast start speeding up download speeds?
01:32 🔗 godane i only ask cause i have 800kbytes down
01:33 🔗 arrith googling found this http://bashscripts.org/forum/viewtopic.php?f=8&t=1081
01:33 🔗 arrith godane: in some areas they increase the dl speed. people usually get an email
01:33 🔗 bsmith093 the thing i have at fanfictiondownloader will auto download if it has more than one chapter, i just have to find out if it wwill continure upon find ing an invalid is
01:34 🔗 arrith oh
01:34 🔗 bsmith093 is sorry this is really slowing down my laptop
01:34 🔗 bsmith093 *id* oy vey typos
01:34 🔗 arrith bsmith093: yeah. maxed out my computer for a little bit. you can renice it and it'll go slower but not take over so much
01:37 🔗 Coderjoe new frontpage: http://i.imgur.com/RPw6K.png
01:38 🔗 PatC Coderjoe, yep :/
01:38 🔗 arrith wow
01:39 🔗 arrith that's quite a change
01:41 🔗 bsmith093 how do i track a files changes in realtime cli
01:42 🔗 bsmith093 when another process is editing it
01:45 🔗 arrith bsmith093: "tail -f file.txt" will output lines getting added to a file. but what i'd do if i were you is "watch ls -lh" in the directory you're generating the txt
01:45 🔗 arrith watch reruns a command, by default every 2 seconds, so you can see how big it's getting
01:46 🔗 arrith tail -f might slow it down is why i say ls over tail
01:48 🔗 bsmith093 its at 270MB and rising
01:48 🔗 bsmith093 linklist
01:48 🔗 arrith er
01:48 🔗 arrith oh
01:48 🔗 arrith i thought for a sec you meant file.txt, heh that'd be way too big
01:59 🔗 arrith btw seems the overall count is 10^7 - 1
01:59 🔗 arrith for ease of notation
02:00 🔗 bsmith093 that finally completed linklist is full of these http://www.fanfiction.net/s//1/
02:00 🔗 bsmith093 inbetween the double slashes is where the id gpes
02:01 🔗 bsmith093 sorry minor glitch there, and i cant see why
02:01 🔗 bsmith093 stopped growing and completed at 306mb
02:01 🔗 arrith hmmm
02:01 🔗 arrith bsmith093: are you on ubuntu or osx?
02:01 🔗 bsmith093 ubuntu
02:02 🔗 bsmith093 lucid 10.04 32bit
02:03 🔗 bsmith093 where was i suppoosed to run the while read $num; do echo http://www.fanfiction.net/s/$num/1/; done < file.txt > linklist.txt casue i just ran it in the terminal, like the generator command earlier
02:03 🔗 Coderjoe "I can't believe my dress ripped. They saw everything! Even my Ranma panties. They change color when I get wet."
02:04 🔗 arrith bsmith093: ah yeah i'm getting the same result. sorry about that
02:04 🔗 bsmith093 wait how does it know where $num is?
02:05 🔗 arrith bsmith093: the "while read $num" is supposed to operate on each line of the file piped in, which is "< file.txt"
02:05 🔗 bsmith093 oh, see this is why im going to be taking a sed and bash scripting class in college
02:07 🔗 arrith errr
02:07 🔗 arrith bsmith093: "while read num" not "while read $num"
02:07 🔗 bsmith093 um ok then re running now
02:07 🔗 arrith while read num; do echo http://www.fanfiction.net/s/$num/1/; done < file.txt > linklist.txt
02:07 🔗 arrith all the same except for that first part
02:08 🔗 arrith sorry about that
02:09 🔗 bsmith093 running perfectly now just gotta wait again
02:09 🔗 bsmith093 meantime lets see what my actual downloader will do to an invalid id
02:12 🔗 arrith bsmith093: good idea
02:13 🔗 arrith bsmith093: btw by my calculations the resulting file should be 80000000 + (10^7-1) * 31 bytes or 371.932954 megabytes (according to google)
02:14 🔗 arrith http://www.google.com/search?q=(80000000+%2B+(10^7-1)+*+31)+bytes+to+megabytes
02:16 🔗 bsmith093 well it works great if the link is valid, other wise it dies
02:17 🔗 bsmith093 althought there is always the scripts its based on.. hold o0n a min
02:19 🔗 bsmith093 here grab this http://fanficdownloader.googlecode.com/files/fanficdownloader-4.0.6.zip
02:20 🔗 arrith ahh python
02:22 🔗 bsmith093 ok now go in and take a look at downloader.py apparently the only thing it CANT do is read links from a file
02:23 🔗 arrith heh
02:23 🔗 bsmith093 is there a pipe for that?
02:23 🔗 arrith welll if i knew python it shouldn't be too hard to add that functionality
02:23 🔗 arrith oh
02:23 🔗 arrith i'd do a bash "while read"
02:24 🔗 Coderjoe http://boingboing.net/2011/12/01/your-tax-dollars-at-work-misl.html
02:24 🔗 arrith bsmith093: while read link; do python downloader.py $link; done < linklist.txt
02:24 🔗 arrith er
02:24 🔗 arrith but if you want .html you have to specify that manually it says
02:25 🔗 arrith so
02:25 🔗 arrith bsmith093: while read link; do python downloader.py -f html $link; done < linklist.txt
02:25 🔗 Coderjoe bah
02:25 🔗 arrith i'd try with 3-5 links before running it on linklist
02:25 🔗 Coderjoe python is simple
02:26 🔗 arrith Coderjoe: not for someone with a huge mental block against learning things in one sitting. i've been meaning to learn it for like a years now. ;(
02:26 🔗 Coderjoe what programming languages do you know?
02:26 🔗 bsmith093 yeah apparently this script was heavily modified to make the binary blob i found, but he did say that, so,... anyway this one wants full urls, not just nice sequential ids
02:27 🔗 Coderjoe (and a real programmer should be able to figure out other languages of the same type fairly easily)
02:27 🔗 bsmith093 fanficdownloader.net is there any way to see inside a linux blob
02:28 🔗 Coderjoe the source is all in the zip file
02:29 🔗 arrith Coderjoe: bash and ti basic
02:30 🔗 arrith fanficdownloader.net isn't loading for me
02:30 🔗 bsmith093 yes but i cant really read python so if uve got it go here fanficdownloader-4.0.6/fanficdownloader/adapters/adapter_fanfictionnet.py
02:30 🔗 arrith bsmith093: what's the difference between the links in linklist and "full urls"?
02:30 🔗 Coderjoe default format of the fanficdownloader python in a zip file is epub
02:30 🔗 bsmith093 fanfictiondownloader.net
02:31 🔗 bsmith093 not fanfic
02:31 🔗 Coderjoe it can also do html or txt
02:31 🔗 arrith oh, help for it seems to say just epub or html
02:31 🔗 arrith derp, nvm. "text or html"
02:32 🔗 Coderjoe though I would prefer to call out to wget-warc or something else that packs a warc
02:32 🔗 bsmith093 linklist has these http://www.fanfiction.net/s/5192986
02:32 🔗 bsmith093 the script currently wants these http://www.fanfiction.net/s/5192986/1/A_Fox_in_Tokyo
02:32 🔗 bsmith093 even though it splices out the story id anyway?!
02:33 🔗 arrith yeah it shouldn't need those
02:33 🔗 arrith bsmith093: wait so it complains if you don't put in the story id?
02:33 🔗 bsmith093 no it complains if you leave off the title like this http://www.fanfiction.net/s/5192986/1/A_Fox_in_Tokyo
02:34 🔗 bsmith093 the thing after the id it wants that
02:34 🔗 bsmith093 which is arbitrary, and not at all sequential
02:34 🔗 arrith bsmith093: try putting a placeholder thing there. like "foo"
02:34 🔗 arrith if it just strips it
02:35 🔗 bsmith093 npe chokes
02:35 🔗 bsmith093 it reads the full url and loads it
02:35 🔗 bsmith093 so that wont work
02:36 🔗 bsmith093 fanficdownloader-4.0.6/fanficdownloader/adapters/base_adapter.py", line 166, in _fetchUrl
02:36 🔗 bsmith093 raise(excpt)
02:38 🔗 arrith hmm
02:46 🔗 arrith bsmith093: do you want epub or html?
02:46 🔗 bsmith093 i suppose for future proffinging purposes not to mention formatting html would be best
02:47 🔗 arrith ah
02:47 🔗 arrith well
02:47 🔗 arrith i'm not getting the error you're getting for some reason
02:48 🔗 arrith all of these link formats work for me with fanficdownloader-4.0.6: http://www.fanfiction.net/s/5192986/ http://www.fanfiction.net/s/5192986/1/ http://www.fanfiction.net/s/5192986/1/A
02:49 🔗 bsmith093 hmmm...
02:49 🔗 arrith i'm doing this basically python /home/arrith/bin/fanficdownloader-4.0.6/downloader.py -f html http://www.fanfiction.net/s/5192986/
02:50 🔗 arrith bsmith093: pastebin all the output downloader.py gives you
02:50 🔗 arrith says stuff like this for me "DEBUG:downloader.py(93):reading [] config file(s), if present"
02:54 🔗 arrith Coderjoe: have you seen anyone using MHT stuff? or is it not that good compared to WARC? (as in this: https://addons.mozilla.org/en-US/firefox/addon/mozilla-archive-format/ )
02:56 🔗 bsmith093 arrith: here http://pastebin.com/XhecfW5M
02:57 🔗 arrith bsmith093: hmm that's pretty odd. at first glance it almost looks like fanfiction.net blocked you
02:58 🔗 bsmith093 i knew this would be more complicated than just a simple mirror
02:59 🔗 arrith bsmith093: can you go to the url for the fanfiction in a browser or with wget or curl?
02:59 🔗 bsmith093 just thinking of that hold on
03:00 🔗 bsmith093 yes but it saves the page as index.html, and with all the iamges and other things
03:00 🔗 arrith yeah
03:00 🔗 arrith huh, but it lets you. interesting
03:01 🔗 bsmith093 but we could run the linkost throught wget and stripout the 404s right
03:01 🔗 bsmith093 but would they even be 404?, damn this is hard
03:02 🔗 arrith a good site would have them as 404s. but yeah, wget has a good option for that: --spider
03:03 🔗 arrith that's a good idea since wget i think would be much faster than this python script
03:03 🔗 arrith bsmith093: btw when you said this earlier, were you talking about fanfiction.net? "<bsmith093> even with the useragent workaround, stopped at 300Kfiles, and i know there are at least 2million stories"
03:04 🔗 bsmith093 oh yeah, but how do i gt that to make a list in a file of the storylinks, sorry for being this helpless, its just wget scares me with its many switches and arcane syntyax
03:04 🔗 bsmith093 yes i was
03:04 🔗 bsmith093 hold on ill lok in the bash history for the command
03:05 🔗 arrith bsmith093: np, i like helping. just i hope at some future point you'll look back over the commands and try to understand them
03:06 🔗 bsmith093 wget-warc -mcpkKe robots=off -U="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.fanfiction.net
03:06 🔗 bsmith093 well, using them this much is helping :)
03:07 🔗 arrith yeah. basically i know what i know due to lots of little jobs. one day i should sit down and read official documentations beyond the manpage but ehh.. not today
03:07 🔗 bsmith093 me too
03:07 🔗 arrith bsmith093: ah, so you were trying to dl ff.net using wget-warc
03:07 🔗 arrith might be why ff.net wouldn't like you that much :P
03:07 🔗 bsmith093 yeah, use a sledgehammer...
03:08 🔗 chronomex to drive a screw!
03:08 🔗 chronomex sure, after much cursing you'll get it in there
03:08 🔗 chronomex but is that really what you need to do?
03:08 🔗 bsmith093 chronomex: wher've you been all this time
03:08 🔗 chronomex when?
03:08 🔗 bsmith093 good to have you in the conversation :)
03:08 🔗 * chronomex waves
03:09 🔗 bsmith093 trying to dl fanfiction.net
03:09 🔗 chronomex so I see
03:09 🔗 bsmith093 any suggestions we're onto wget finally, but its tricky
03:10 🔗 arrith dang, ff.net doesn't give a 404 for nonexistent stories
03:10 🔗 bsmith093 could we parse the page for story not found
03:10 🔗 arrith yeah, gotta do that
03:10 🔗 bsmith093 or is it faster to grab anyway, then parse later
03:10 🔗 arrith bsmith093: but that involves dling the page, which is more bandwidth ff.net has to suffer
03:11 🔗 bsmith093 wget has a random wait option
03:11 🔗 arrith oh. all depends on what you want to do. could do it both ways. i tend to like grabbing then parsing later, but i figure diskspace is cheap
03:11 🔗 bsmith093 cant remember th e switch
03:11 🔗 arrith --random-wait :)
03:11 🔗 bsmith093 hey will -spider tell us how big it is
03:12 🔗 arrith random wait is more about not blocking wget than saving the host bandwidth
03:12 🔗 arrith ah interesting thought
03:12 🔗 bsmith093 cause that would be good to know
03:12 🔗 arrith doesn't seem to. just says "Length: unspecified [text/html]"
03:12 🔗 bsmith093 also on the archiveteam, see where tsp got to, this was apparently his baby
03:12 🔗 arrith i'm just looking at the output of this btw: wget --spider http://www.fanfiction.net/s/9999999/
03:13 🔗 bsmith093 sorry the archiveteam *wiki*
03:13 🔗 bsmith093 and
03:14 🔗 arrith np, i got that. only mention i find says "Tsp is attempting to archive the stories from fanfiction.net and fictionpress." on http://archiveteam.org/index.php?title=Projects
03:14 🔗 bsmith093 plus that might not necessarily grab all the chapters, either.
03:15 🔗 chronomex been a while since I've seen Teaspoon around
03:16 🔗 arrith bsmith093: what won't get all the chapters? wget? i figured we're just using wget (or curl) to check if the story exists, then feeding a list of stories into download.py
03:16 🔗 arrith dang, would be nice if that guy had some writeup somewhere on his progress
03:18 🔗 bsmith093 oh yeah, the python script
03:18 🔗 bsmith093 i know, right?
03:19 🔗 arrith bsmith093: i gotta go for a bit for dinner. i'll bbl. next step i see is to get curl or wget to go over the linklist and sort them into a known good (and maybe known bad) list. i'll help with that if you need it when i get back
03:20 🔗 bsmith093 ill look ove the wget man pages to see i fwe cant just sane eveeryhting as a uniwue name, and sort later
03:20 🔗 bsmith093 Coderjoe: you still here
03:20 🔗 bsmith093 ???
03:21 🔗 bsmith093 chronomex:
03:21 🔗 bsmith093 alard: \
03:21 🔗 chronomex what, hi
03:21 🔗 chronomex I'm tending a makerbot.
03:21 🔗 bsmith093 oohh , yay for you!]
03:21 🔗 chronomex so I'm here, just not watching irc
03:22 🔗 bsmith093 can curl parse html, for a certain string
03:22 🔗 chronomex no
03:22 🔗 bsmith093 ( i know nothing about it)
03:22 🔗 chronomex curl does not look at what it downloads for you
03:22 🔗 bsmith093 cause i have a massive linklist for ff.net and most of them probably dont exist
03:23 🔗 bsmith093 ffnet returns html with story not found if that id doesnt exist
03:23 🔗 chronomex sounds like you'll have to write a custom-ishspider
03:24 🔗 bsmith093 ugh
03:25 🔗 bsmith093 any ideas on code thats already done part of something like this?
03:26 🔗 bsmith093 alard: u seem to write most of the scripts archiveteam uses, any ideas o n saving ff.net
03:38 🔗 underscor <alard> For some reason heritrix doesn't really listen to my parallelQueues = 15 setting. It's just running one queue
03:39 🔗 underscor From what I remember of the presentation at IA, you can't spread the same domain into multiple queues
03:43 🔗 chronomex huh.
03:43 🔗 underscor Also, you gamepro guys
03:43 🔗 underscor Make sure you're getting the articles, there are a lot of interstitials
03:43 🔗 chronomex server-side interstitials?
03:44 🔗 bsmith093 any suggestions for things that might work for ffnet
03:44 🔗 underscor The two I got were meta redirects
03:44 🔗 underscor so, yeah
03:44 🔗 chronomex o ok
03:44 🔗 underscor bsmith093: not off the top of my head
03:44 🔗 underscor but it's definitely something I'm interested in
03:48 🔗 bsmith093 hey do invalid links have identical md5sum
03:48 🔗 bsmith093 doesnt really solve the bandwidth load issue, but it would help with weeding
03:48 🔗 Coderjoe bsmith093: sorry, I was trying to get some work done
03:49 🔗 bsmith093 np, we all have lives ;)
03:49 🔗 bsmith093 whoops wrong emote
03:55 🔗 arrith back
03:55 🔗 bsmith095 hey
03:56 🔗 arrith bsmith095: hey
03:56 🔗 bsmith095 so i was looking and i cant find anything to parse a webpage, which is odd
03:56 🔗 arrith ah
03:56 🔗 arrith checking for a thing existing or not shouldn't be hard. just grab the page then grep it
03:56 🔗 arrith well there's a bunch of python libraries to do it 'properly' but i just use grep and exit codes
03:57 🔗 bsmith095 yeah but again the huge bandwidth issue
03:57 🔗 bsmith095 and i have absolutely no idea how u guys do it, but id love to paralellize this problem, how much space could 20 billion words possibly take up?
03:57 🔗 arrith bsmith095: oh, ehh. well at least with wget i wasn't really able to find a way to get it to report page size
03:58 🔗 bsmith095 do they have same md5sum that ould help
03:59 🔗 arrith yeah that's one way. but i'm pretty sure a grep would work fine. it wouldn't be futureproof but it'd get the job done at first
03:59 🔗 arrith bsmith095: but wait, so did that one wget-warc download a decent amount of ff.net? since you said it got up to 300k or something?
03:59 🔗 bsmith095 yeah but it was beat t hell with hmtl, and css and ads and things, plus i needed the space so i dumped it
04:00 🔗 arrith ah
04:00 🔗 arrith bsmith095: i was just wondering what kind of error you ended up getting since that's when ff.net might've blocked you
04:00 🔗 arrith btw about 20 billion words, assuming 5 characters per word and a following space: 20 billion * ((5 bytes) + (1 byte)) = 111.758709 gigabytes
04:00 🔗 bsmith095 holy christ, thats a lot! even for now, wehere u can buy terabytes like bread
04:01 🔗 arrith heh
04:01 🔗 arrith well compression goes a heck of a long way though. bzip or gzip should do a small fraction of that
04:01 🔗 bsmith095 btw thats the estimate of how big ff is
04:01 🔗 arrith for text at least
04:01 🔗 arrith ah
04:01 🔗 arrith well i'd say you could get that down to maybe a few GB, probably less
04:01 🔗 arrith with compression
04:03 🔗 bsmith095 can i compress than immediately dump the uncompressed, without completely killing my slighty overworked hardrive?
04:03 🔗 bsmith095 is there an app for that, cuz it would rock
04:05 🔗 arrith bsmith095: that'd just be part of a script
04:06 🔗 arrith like have the download.py grab the file, then compress it right after
04:06 🔗 bsmith095 man, i have *got* to learn scripting :D
04:06 🔗 arrith it'd compress individually which won't save as much space, but you can recompress them all in batch after it's done
04:07 🔗 arrith well i can put together a small thing in bash that'll get this job done. then you can learn python and make it all fancy :)
04:07 🔗 bsmith095 thatll work
04:07 🔗 arrith one thing you gotta figure out is why download.py isn't working though
04:07 🔗 bsmith095 hey does amaozn ec2 have a trial i could completely kill one of their instances so my porr laptop doenst have to
04:09 🔗 Coderjoe good god ff.net are pricks.
04:09 🔗 Coderjoe Disallow: /
04:09 🔗 Coderjoe User-agent: ia_archiver
04:09 🔗 bsmith095 i figure i could dump the links into wget , have it name the files based on the id# then grep for story not found
04:09 🔗 bsmith095 use wget
04:10 🔗 arrith one sec
04:10 🔗 chronomex Coderjoe: I'm not surprised, given the fanfic people I've known.
04:10 🔗 bsmith095 would amazon ec2 let me use them for this?
04:11 🔗 chronomex there's so much shady shit on ec2
04:11 🔗 arrith bsmith095: might be. but don't run it on anything you need for business incase it does get shut down
04:11 🔗 chronomex just pay your bill and don't run a botnet.
04:11 🔗 bsmith095 meaning what?
04:11 🔗 Coderjoe they'll let you do a lot... but you will probably wind up paying a bunch in bandwidth and instance
04:12 🔗 bsmith095 ugh, bandwidth again, its 2011, nearly 2012, i thought we were past this!
04:12 🔗 Coderjoe (my bill for nov is $255.62)
04:12 🔗 Coderjoe not in the server market
04:12 🔗 bsmith095 for what kinnd of usage
04:12 🔗 Coderjoe general rule is "sender pays"
04:13 🔗 bsmith095 can this thing be paralellized easliy
04:14 🔗 arrith bsmith095: yeah
04:14 🔗 arrith when we did the google video stuff we just put together chunks and people claimed the chunks
04:14 🔗 arrith one person does, 1-20,000, another does 20,000-40,000, etc
04:14 🔗 arrith er 20,001
04:15 🔗 bsmith095 well in have that 300mb file i could pass around
04:15 🔗 arrith yeah. well i think there's some kind of script already for delegating stuff that was mentioned earlier
04:15 🔗 arrith i was gonna look into that
04:15 🔗 arrith "<db48x> integrate it with the scripts we used for splinder/mobileme/anyhub"
04:16 🔗 arrith whatever those are
04:16 🔗 Coderjoe $97.99 in instance charges (one free micro, 245 hours of an m2.xlarge, 194 hours of an m1.large), $64.26 in s3 (stashed some grab stuff in there to get it off an instance. the storage was cheaper than the bandwidth out.), $93.37 in data transfer (873.951GB in for free, 15GB out for free, 778.049GB out for $0.120/gb)
04:16 🔗 bsmith095 therea a repo for them
04:17 🔗 Coderjoe i would much rather do parallelization with a full clean script and a tracker that hands out chunks of a few stories
04:17 🔗 bsmith095 whoo! thats cheap but not super cheap
04:17 🔗 Coderjoe and I was using spot instances for those two non-free instances
04:17 🔗 underscor <Coderjoe> User-agent: ia_archiver
04:17 🔗 underscor <Coderjoe> Disallow: /
04:17 🔗 underscor I wish they would just disobey it
04:18 🔗 underscor I mean
04:18 🔗 underscor Archive teh site regardless
04:18 🔗 Coderjoe when is ff.net going down?
04:18 🔗 underscor but if the robots.txt blocks it, just don't make it public
04:18 🔗 bsmith095 the story links are fanfiction.net/s/0000000 through 9999999
04:18 🔗 arrith Coderjoe: i don't think it is. i think this is just pre-emptive
04:18 🔗 underscor Coderjoe: Pre-emptive afaik
04:18 🔗 bsmith095 Coderjoe: its not that i know fo, im being proactive
04:19 🔗 bsmith095 this is worse than geocities, mostly b/c the "creative, irriplaceable stuff" wuotient is much higher
04:19 🔗 bsmith095 quotient, can u tell im typing byt the light of my monitor
04:19 🔗 underscor bsmith095: Why not iterate through every combination?
04:20 🔗 arrith btw was it determined if the geocities effort got all of geocities or were some sites lost?
04:20 🔗 Coderjoe underscor: each story has chapters which are on separate pages
04:20 🔗 bsmith095 we could and i was going to, but thats 10 million links, most of which are non existent story wise, but which give back a page saying stroy not found
04:20 🔗 Coderjoe arrith: i think sites were lost
04:20 🔗 arrith underscor: we've done that. we have a tool that checks each fanfiction id for chapters
04:20 🔗 bsmith095 and also that chapter thing
04:21 🔗 bsmith095 arrith: we do
04:21 🔗 bsmith095 ??
04:21 🔗 arrith bsmith095: oh yeah, sorry, i thought you knew kinda. the download.py takes just a normal link and grabs all available chapters
04:21 🔗 underscor bsmith095: No need
04:21 🔗 arrith one sec i'll pastebin
04:21 🔗 underscor Just send a HEAD request
04:21 🔗 underscor Only a few bytes
04:21 🔗 bsmith095 a what now?
04:21 🔗 underscor curl -I http://www.fanfiction.net/s/7597723
04:22 🔗 underscor Just gets the headers
04:22 🔗 underscor Tells you whether a story exists or not
04:22 🔗 underscor Then you can go back later on
04:22 🔗 arrith bsmith095: http://pastebin.com/kKpNxEBy
04:22 🔗 bsmith095 underscor: I HEART U
04:22 🔗 arrith underscor: oh yeah that's what we're trying to do now. i was gonna wget then grep for "story not found", but hmm
04:22 🔗 underscor but then you have to download the whole page
04:22 🔗 bsmith095 thats exactly what i was looking for!!!
04:22 🔗 underscor -I is a lot better
04:22 🔗 underscor Now, the interesting thing
04:22 🔗 arrith underscor: is -I to chek for a 404?
04:23 🔗 underscor is that it always returns 200 Ok
04:23 🔗 underscor Nope
04:23 🔗 underscor It just sends you the ehaders
04:23 🔗 underscor But a valid story will have a header like
04:23 🔗 underscor Cache-Control: public,max-age=1800
04:23 🔗 underscor Last-Modified: Fri, 02 Dec 2011 04:21:35 GMT
04:23 🔗 underscor Invalid ones will have
04:23 🔗 underscor Cache-Control: no-store
04:23 🔗 underscor Expires: -1
04:23 🔗 bsmith095 hey thanks want the linklist
04:23 🔗 arrith ohh clever
04:23 🔗 underscor bsmith095: Isn't it just 0-999999
04:24 🔗 underscor ?
04:24 🔗 bsmith095 see io knew the web would come up with something!
04:24 🔗 arrith 0000000 actually
04:24 🔗 bsmith095 yes
04:24 🔗 arrith i think
04:24 🔗 bsmith095 7 digits
04:24 🔗 arrith dunno if 0 works too
04:24 🔗 bsmith095 probably
04:24 🔗 arrith yeah
04:24 🔗 underscor ok
04:24 🔗 arrith 10^7-1
04:24 🔗 bsmith095 no 7 dig
04:24 🔗 bsmith095 10 million
04:24 🔗 underscor seq -w 0 9999999
04:24 🔗 underscor bam
04:24 🔗 arrith we used this to gen a numberlist: echo {0000000..9999999 } | tr ' ' '\n' > file.txt
04:24 🔗 underscor oh, that works too
04:25 🔗 arrith ah yeah, i wasn't sure about seq. just the #bash people always say to use {x..n} over seq, forget why
04:25 🔗 underscor because it's a builtin probably
04:25 🔗 underscor I prefer seq though, personally
04:25 🔗 arrith ah seq does newlines, nice
04:26 🔗 arrith just for fun i'm "time"ing them
04:26 🔗 underscor Yeah, that's one of the reasons
04:26 🔗 bsmith095 190s probably
04:27 🔗 arrith 0m8.544s for seq, my echo one is still running
04:27 🔗 arrith just finished 0m39.947s
04:27 🔗 bsmith095 ure fast
04:27 🔗 arrith seq is so the way to go heh
04:28 🔗 bsmith095 now we just have to get the damn downloader script to take id# as opposed to id# and title links
04:28 🔗 arrith bsmith095: welll i think that's an ip issue, not the script necessarily
04:28 🔗 * underscor quietly works on his own version in bash
04:28 🔗 arrith since it works fine for me
04:28 🔗 arrith underscor: haha
04:28 🔗 arrith lemme pastebin the snippets i have so far
04:28 🔗 underscor I actually started this back in like March
04:29 🔗 underscor haha
04:29 🔗 arrith ohh
04:29 🔗 arrith good to hear
04:30 🔗 bsmith095 pass me a valid link
04:30 🔗 arrith actually the stuff i have is just weird stuff using grep and a thing to generate a ~350MB file of linklists
04:30 🔗 bsmith095 just to test
04:30 🔗 arrith http://www.fanfiction.net/s/7597723
04:30 🔗 bsmith095 anyone have a valid story link
04:31 🔗 arrith or http://www.fanfiction.net/s/5192986/
04:32 🔗 arrith underscor: is your stuff in a single bash script? and are you using fanficdownloader-4.0.6 ?
04:32 🔗 bsmith095 output http://pastebin.com/5Q09g7xB
04:32 🔗 underscor My stuff is a single bash script
04:32 🔗 underscor and no, I didn't know it existed
04:32 🔗 chronomex clearly not enterprisey enough
04:33 🔗 bsmith095 pastebit it please?
04:33 🔗 Coderjoe underscor: for distributed efforts, I would prefer something like python over bash. bash relies on other processes on the system and as a result has too many variations
04:33 🔗 chronomex yes, bash scripts are pretty fragile
04:33 🔗 underscor Absolutely, I agree
04:34 🔗 bsmith095 really? yours have been pretty robust
04:34 🔗 underscor I'm not very comfortable with python though, so I'm just dicking around in bash atm
04:34 🔗 chronomex it takes work to make them robust.
04:34 🔗 Coderjoe (farming out to a wget process is ok)
04:34 🔗 bsmith095 ah
04:34 🔗 * chronomex currently scraping several million pages with ruby
04:34 🔗 arrith learning python is something. but eh, you can keep bash pretty portable
04:35 🔗 bsmith095 chronomex: ruby?
04:35 🔗 chronomex bsmith095: yes, I've been using ruby lately
04:35 🔗 underscor Ruby has a lovely http library
04:35 🔗 bsmith095 weeding the linklist
04:35 🔗 arrith underscor: getting the ff.net effort as part of the scripts for mobile me and stuff to distribute the effort i think would be good
04:35 🔗 underscor typhoeus or something
04:35 🔗 underscor arrith: I agree
04:36 🔗 underscor However, I am probably not your man
04:36 🔗 bsmith095 typhoes ?? whst
04:36 🔗 underscor mostly due to time
04:36 🔗 arrith underscor: is your bash stuff at a state you can show people?
04:36 🔗 chronomex underscor: I use Mechanize and Hpricot.
04:36 🔗 underscor arrith: Not atm
04:36 🔗 underscor I'll work on fixing it up
04:36 🔗 underscor chronomex: https://github.com/dbalatero/typhoeus
04:37 🔗 Coderjoe arrith: not really. dld-streamer.sh (and my chunky.sh from friendster) relied on associative arrays. CentOS has too old of a bash. freebsd and osx have BSD userland, while unix people typically have gnu userland. and there have been bugs between different version of tools like expr.
04:37 🔗 chronomex underscor: hmmm, neat. I'm scraping single sites, though, so I don't have much use for 1000 threads :P
04:37 🔗 arrith underscor: hmm alright. i'm not sure what you've done so i dunno if the current methods bsmith095 and i are using are the best ones
04:37 🔗 underscor I tend to do things the "fuck the building is burning down, just get some shit written" way
04:37 🔗 underscor so anything y'all do is probably cleaner
04:37 🔗 chronomex ^
04:38 🔗 underscor My bash scripts are basically the exact opposite of alard's
04:38 🔗 chronomex I do things the underscor way, unless I'm releasing it into the wild.
04:38 🔗 arrith i'm doing "i barely know how to string this stuff together but it seems to work so w/e"
04:38 🔗 underscor Extremely non portable, and no idiot checks
04:38 🔗 bsmith095 yeah, alard would know, somebody wake him/(her?) up
04:38 🔗 underscor him
04:38 🔗 bsmith095 ah
04:38 🔗 underscor He's in NL (I think?) so it's early there (?)
04:39 🔗 bsmith095 where are all the female geeks
04:39 🔗 arrith Coderjoe: ah expr bugs don't sound fun, and yeah you have to avoid a lot to make a portable script. just egh, bash just seems easier to me than python
04:39 🔗 arrith bsmith095: asleep
04:39 🔗 bsmith095 NL, wheers that
04:39 🔗 chronomex arrith: bash is easier to get into but harder to make work well.
04:39 🔗 Coderjoe python is not that hard, and you have a HUGE standard library to rely on
04:39 🔗 bsmith095 terrible with geography
04:39 🔗 chronomex bsmith095: the netherlands? that's in europe, silly
04:39 🔗 bsmith095 ah yeah the web is global, i forgot
04:40 🔗 chronomex ....
04:40 🔗 bsmith095 i once got into an argument with my dad that u cant go east to get to russia, im like no wait thats the other side of ... oh wow duh :D
04:41 🔗 bsmith095 been looking at maps too long need a glboe
04:41 🔗 bsmith095 globe
04:41 🔗 arrith yeah dang, i gotta learn python. and go through all the grueling hours of relearning how to do some simple thing
04:41 🔗 Coderjoe er, the problem was in grep
04:41 🔗 Coderjoe https://github.com/ArchiveTeam/friendster-scrape/commit/b1f5b72cd13e20d6b02c20d8fc7b2710fc816a61
04:41 🔗 bsmith095 arrith: sorry for the tedium
04:41 🔗 arrith with bash it's like you can copy and paste around a bunch, python feels like you can't just piece stuff together
04:42 🔗 arrith Coderjoe: oh dang, grep -o. i've run into so many "-o" bugs it's not even funny
04:42 🔗 underscor For example
04:42 🔗 underscor This is what I'm using to test IDs
04:42 🔗 arrith i just avoid it and do weird sed mangling
04:42 🔗 underscor It's dirty as hell
04:42 🔗 underscor var=`curl -s -I http://www.fanfiction.net/s/5983988|grep Last`;if [ -z $var ]; then echo "Not a story";else echo "Story";fi
04:43 🔗 arrith underscor: i was thinking about asking you how you did that. just now i was diffing the output of various curl -Is
04:43 🔗 bsmith095 xow print that list to a file and were golden'
04:43 🔗 underscor I use grep -oP all the time
04:43 🔗 underscor it's rad
04:43 🔗 bsmith095 wth is the z switch
04:43 🔗 chronomex bsmith095: null-terminated
04:43 🔗 chronomex oh, in [
04:43 🔗 chronomex bsmith095: 'man test'
04:43 🔗 underscor It's "if it is set"
04:44 🔗 arrith yeah if it exists
04:44 🔗 Coderjoe -z is string is empty
04:44 🔗 arrith underscor: i tend to go off of grep's exit code
04:44 🔗 arrith grep thing; if [ $? -eq 0 ]; then; stuff; fi
04:45 🔗 arrith exit codes seem 'faster' to me
04:45 🔗 Coderjoe underscor: what happens if the story has the word "Last" in it?
04:45 🔗 underscor Doesn't matter
04:45 🔗 underscor curl -I gets headers only
04:46 🔗 Coderjoe oh. you're checking Last-modified
04:46 🔗 underscor Yeah
04:46 🔗 underscor invalid stories don't have ti
04:46 🔗 underscor s/ti/it/
04:46 🔗 Coderjoe why can't people just use frigging HTML status codes. this is EXACTLY what 404 is for, and you can still have your own custom 404 page
04:47 🔗 chronomex s/HTML/HTTP/
04:47 🔗 Coderjoe yes, I meant http
04:48 🔗 arrith yeah for stories that don't exist they don't have Last-Modified, and they also have "Cache-Control: no-store" and "Expires: -1"
04:48 🔗 chronomex that's fucking retarded.
04:48 🔗 arrith Coderjoe: seriously. ff.net not using 404s is so annoying right now
04:49 🔗 underscor I wonder how ff.net feels about 10 million HEAD requests
04:49 🔗 underscor lol
04:50 🔗 arrith they should've used 404s :P
04:50 🔗 arrith better than grabbing the full page like i was gonna do..
04:50 🔗 underscor Well, they'd be HEADs regardless
04:50 🔗 underscor At least we don't have to grab the full page
04:50 🔗 arrith oh, right
04:51 🔗 arrith i guess i assumed whatever wget --spider does is as lightweight as it can get. i actually don't know what's in what it does
04:51 🔗 arrith some kind of HEAD
04:52 🔗 arrith what's the best thing like piratepad to use these days in terms of doesn't time out?
04:52 🔗 chronomex typewith.me ?
04:52 🔗 arrith oh hmm wait actually, there's one for code
04:52 🔗 arrith i forget its name. it's new
04:53 🔗 Coderjoe arrith: splinder wasn't using a status code to say "hey, we're temporarily down for maintenance". instead they redirected to /splinder_noconn.html which was a 200
04:53 🔗 arrith Coderjoe: ahh wow
04:54 🔗 arrith ahh i was thinking of stypi but it doesn't have bash/sh support
04:54 🔗 arrith ;/
04:54 🔗 arrith i wonder what they do support is the closest to bash
04:59 🔗 underscor \o/ Progress
04:59 🔗 underscor http://pastebin.com/ReqNs8TF
05:00 🔗 arrith underscor: looks good
05:02 🔗 bsmith093 i hate wireless when im at the fringes, what i miss?
05:03 🔗 arrith bsmith093: one sec
05:06 🔗 arrith bsmith093: http://pastebin.com/1QN2tagB
05:07 🔗 arrith oh
05:07 🔗 arrith bsmith093: also http://badcheese.com/~steve/atlogs
05:08 🔗 arrith forgot this channel had that
05:13 🔗 bsmith093 ok now pass stpryinator the valid ids and itl gram them all
05:14 🔗 arrith bsmith093: stpryinator?
05:14 🔗 bsmith093 storyinator the output of the laste pastebin link
05:15 🔗 bsmith093 do we have a weeded list yet?
05:16 🔗 arrith bsmith093: is storyinator something? searching the backlog doesn't show anything
05:16 🔗 arrith the download.py thing?
05:16 🔗 bsmith093 check the logs the last pastebin
05:17 🔗 arrith ohh
05:17 🔗 arrith bsmith093: storyinator is i guess the name of what underscor is working on. it's not done yet
05:17 🔗 bsmith093 http://pastebin.com/ReqNs8TF
05:19 🔗 arrith bsmith093: yeah that, it's not done yet
05:19 🔗 arrith but here: http://paste.pocoo.org/show/515656/
05:20 🔗 arrith that's not to be run just as a script, but each piece kinda ran individually
05:21 🔗 arrith bsmith093: should generate a list of good and bad IDs (lines 21-29) then you just feed the list of good IDs into the fanficdownloader (lines 32-34)
05:21 🔗 arrith assumes you already have a nums.txt
05:23 🔗 bsmith093 yay u rock
05:24 🔗 bsmith093 see i knew this would'nt get done unless i nagged the community to get to it, and save already
05:25 🔗 bsmith093 200tb wont back itself up, but this is nothing compared to mobilme, in terms opf volume anyway
05:27 🔗 bsmith093 thiss will probably take all night
05:32 🔗 bsmith093 good news is i can just dump these good id# links back into the original fanfic downloader im personally using
05:37 🔗 underscor wheeeee
05:37 🔗 underscor Frontpage Gotten
05:37 🔗 underscor Let's get some metadata.
05:37 🔗 underscor Running storyinator on id 5983988
05:37 🔗 underscor Title is A Different Beginning for the Cherry Blossom
05:37 🔗 underscor Writen by Soulless Light, whose userid is 1807842
05:37 🔗 underscor Placed in anime>>Naruto
05:38 🔗 bsmith093 can i get that script
05:40 🔗 bsmith093 apporx 7 hrs till the id sorter is done sorting
05:41 🔗 underscor bsmith093: Doesn't actually download anything yet
05:41 🔗 bsmith093 try this, fanfictiondownloader.net
05:41 🔗 bsmith093 throw the good story ids in there
05:42 🔗 bsmith093 fanfictiondownloader.com
05:42 🔗 bsmith093 nevermind it is .net
05:43 🔗 bsmith093 www.fanfictiondownloader.net
05:43 🔗 arrith oh one thing
05:43 🔗 bsmith093 what
05:43 🔗 arrith bsmith093, underscor: the download.py thing only gets the stories i think
05:43 🔗 underscor oh okay
05:43 🔗 arrith but on ff.net there's like author commentary, history of stuff getting posted
05:43 🔗 bsmith093 ummm yeah thats the idea
05:43 🔗 underscor I'm getting reviews and a bunch of stuff
05:43 🔗 arrith yeah
05:44 🔗 arrith there's a lot more to the site than the stories
05:44 🔗 arrith so i'd want to include those in a proper archival process
05:44 🔗 arrith underscor: good to hear
05:44 🔗 bsmith093 well ok then you get that ill get the stories
05:44 🔗 arrith bsmith093: heh well a backup of just the stories is still good to have
05:44 🔗 arrith then in a scramble there's a lot less to dl
05:44 🔗 bsmith093 how will we re run this to update the archive
05:44 🔗 bsmith093 just futureproofing here
05:44 🔗 arrith yeah that's something, periodic rerunning
05:45 🔗 bsmith093 merge the deltas
05:45 🔗 arrith i didn't really put stuff into that script for that above hmm
05:45 🔗 bsmith093 see this is why we only bother to archive closing sites, they dont change a smuch
05:46 🔗 underscor well, just figure out what the current latest ID is
05:46 🔗 bsmith093 umm how exaclty
05:47 🔗 underscor 7601310 is the current latest
05:47 🔗 bsmith093 ure head thing done yet?
05:47 🔗 underscor http://m.fanfiction.net/j/
05:47 🔗 bsmith093 as of when
05:47 🔗 underscor 10 seconds ago
05:47 🔗 underscor It's whatever's on the top of that page
05:48 🔗 bsmith093 see this is what im talking about, we'll always be behin dthis site
05:48 🔗 arrith underscor: and every number up to that is known used?
05:48 🔗 bsmith093 lots of skipped
05:48 🔗 arrith hm
05:48 🔗 underscor No
05:49 🔗 underscor But it's easy to have something check that page every 5 minutes
05:49 🔗 Coderjoe well, if you use wget-warc, with a cdx file, you can have it save the updated page
05:49 🔗 bsmith093 arrith: run ure own script youll see there are a lot of holes in the seq
05:49 🔗 arrith bsmith093: ah
05:50 🔗 arrith you'd need to check that page of new stuff pretty rapidly to make sure you don't miss anything
05:50 🔗 Coderjoe not really
05:50 🔗 underscor Yeah
05:50 🔗 underscor once every 10 minutes is probably sufficient
05:51 🔗 underscor or less often
05:51 🔗 arrith since the only other way i can think of to get new chapters is to go through story IDs to check for any new ones, then check all working story IDs for new chapters
05:51 🔗 bsmith093 again, not to beat a dead horse, but this is a huge, popular, currently active website
05:51 🔗 underscor yeah
05:51 🔗 Coderjoe you just have one worker that checks it periodically, checks between the last-known-max and the latest on that page, and notifies the tracker
05:51 🔗 arrith oh
05:51 🔗 bsmith093 and thats another thing can somebody please code up a tracker?
05:52 🔗 arrith right if it's always sequential. for a second i thought it was random, nvm
05:52 🔗 bsmith093 so isit seq
05:52 🔗 Coderjoe the holes are from deleted stories
05:52 🔗 arrith does that updating page include reviews/author comments/user comments/etc though? looks to me like it's just new stories
05:53 🔗 bsmith093 thats a lot of deletions any hope of recovery?
05:53 🔗 bsmith093 ia waybac maybe?
05:53 🔗 underscor ia doesn't archive it
05:53 🔗 underscor arrith: nope
05:53 🔗 Coderjoe they block IA, remember
05:54 🔗 bsmith093 WTH not?!
05:54 🔗 Coderjoe they block IA, remember
05:54 🔗 bsmith093 so use googlebot, well its too late now, but still!
05:54 🔗 arrith all the more reason to have an ongoing mirror
05:54 🔗 Coderjoe http://www.fanfiction.net/robots.txt
05:54 🔗 arrith which afaik means periodic respidering
05:54 🔗 arrith for new comments, etc
05:55 🔗 underscor yep
05:55 🔗 bsmith093 and periodic dumps, like for wikipedia, but actually GOOD
05:55 🔗 arrith yeah i recently saw someone talking about looking for directions on how to setup a wikipedia dump and was having a bit of trouble. i dunno how easy it is but it didn't sound fun to me
05:56 🔗 Coderjoe http://b.fanfiction.net/atom/j/0/2/0/
05:56 🔗 Coderjoe "updated stories" in a nice rss feed
05:56 🔗 bsmith093 see what a kick in the pants can do for productivity, none of this,(afaik) was happening 6 hrs ago
05:56 🔗 arrith Coderjoe: ah nice and structured. gj
05:57 🔗 bsmith093 ffnet is "fully automate"
05:57 🔗 arrith bsmith093: ehh underscor was doing some stuff i think technically
05:57 🔗 bsmith093 thats why i said afaik
05:57 🔗 Coderjoe and here's new stories: http://b.fanfiction.net/atom/j/0/0/0/
05:58 🔗 bsmith093 yay, an atom feed we can scrape that!
05:58 🔗 arrith and wherever that one guy left off. if he left a record
05:58 🔗 bsmith093 he didnt
05:58 🔗 arrith underscor: did you track how far Teaspoon / tsp got on ff.net?
05:59 🔗 underscor Nope, sorry
05:59 🔗 bsmith093 SketchCow: if ura still up, any thought/input/ constructive critisisms
05:59 🔗 Coderjoe reviews are under /r/ instead of /s/
05:59 🔗 Coderjoe http://www.fanfiction.net/r/7573167/
05:59 🔗 bsmith093 thats useful go, automation!
06:00 🔗 bsmith093 does the r match the s for the same story
06:00 🔗 arrith underscor: np. eh well, probably not that far
06:00 🔗 bsmith093 same #
06:00 🔗 underscor yes
06:00 🔗 Coderjoe yes
06:00 🔗 bsmith093 whhoohoo ! so easy then
06:01 🔗 Coderjoe there are also communities and forums that need archiving
06:01 🔗 bsmith093 are those braindead simple url too
06:01 🔗 Coderjoe communities are not
06:02 🔗 Coderjoe nor are forums
06:02 🔗 bsmith093 oy well cant have everything
06:02 🔗 bsmith093 oh wait, I can, GO ARCHIVETEAM!
06:02 🔗 arrith Coderjoe: you wouldn't happen to have found an atom/rss feed for reviews have you?
06:03 🔗 arrith since the more structured the less darkarts html parsing
06:04 🔗 Coderjoe they might be on that update feed
06:04 🔗 Coderjoe (the first one, /atom/j/0/2/0)
06:04 🔗 bsmith093 best part about this script is its much lighter pon my cpu and disk io
06:04 🔗 Coderjoe it will list the story, and I think a new review might cause it to go on that
06:05 🔗 Coderjoe I know it does chapters
06:05 🔗 bsmith093 just passed 0006000
06:06 🔗 Coderjoe nope. this story has one review posted a couple weeks ago
06:06 🔗 Coderjoe but is listed on the update feed with an update date of today. I think they posted a new chapter
06:06 🔗 bsmith093 see if people wouldn't write so much we'd have less work :)
06:09 🔗 bsmith093 its 10927est so off to bed for me, not leaving thought will be asleep
06:11 🔗 bsmith093 quick thought here it would be great if once we have everything eatch out for ompleted sotries and pull them from the scrape queue
06:11 🔗 bsmith093 completed stories and pull them out
06:12 🔗 bsmith093 gnight
06:13 🔗 Coderjoe still want to scrape for reviews. and there is nothing that says the author can't revise something
06:14 🔗 arrith yeahh i was thinking author edits
06:14 🔗 arrith and author comments
06:14 🔗 arrith i don't have experience with this kind of stuff but i'm hoping the header last modified is accurate in this case
06:15 🔗 arrith bsmith093: you're checking for existing stories?
06:15 🔗 arrith since if fanfiction.net is blocking him ( bsmith093 ) then he'd just get a big list of false negatives ;/
06:16 🔗 underscor Do new comments change the "update date"?
06:16 🔗 Coderjoe no
06:16 🔗 underscor ok
06:16 🔗 Coderjoe at least I don't think so
06:17 🔗 Coderjoe that's something I haven't specifically checked. I did find that the stories I looked at had a newer update date than the last review
06:20 🔗 arrith underscor: for ease of notation, you can have the 999 stuff like this: seq -w 0 $((10**7 - 1))
06:20 🔗 arrith or $[10**7 - 1]
06:20 🔗 Coderjoe bleh
06:20 🔗 arrith hard for me at least to visually see how many 9s there are
06:20 🔗 Coderjoe seq was another compatability issue
06:20 🔗 godane i have 87 episodes of crankygeeks
06:21 🔗 arrith Coderjoe: ohh yeah. trying to shoehorn seq stuff into jot on osx
06:21 🔗 godane i'm just getting ipod format since its only 105mb after episode 70
06:22 🔗 godane mostly so i can fit 40 episodes onto a dvd
06:23 🔗 NotGLaDOS At that size, don't you mean approx 47?
06:23 🔗 arrith 4.3 for gibibytes
06:23 🔗 arrith about
06:23 🔗 godane more like 43
06:23 🔗 NotGLaDOS Ah.
06:23 🔗 NotGLaDOS I have a 4.7 GB DVD-R here.
06:23 🔗 godane some are 115mb
06:23 🔗 arrith 4.7 uses the base 10 'gigabyte' harddrive mfw scammers use
06:24 🔗 NotGLaDOS Also, why is this not changing nick to NotGL-
06:24 🔗 arrith 4.7 gigabyte to gibibyte= 4.3772161006927490234375 gibibytes
06:24 🔗 NotGLaDOS Ah
06:25 🔗 arrith underscor: you gotta have a github repo called "DON'T LOOK HERE" then just secretly push stuff like you ff.net work ;P
06:26 🔗 NotGLaDOS <*status> | STR_IDENT | 1 | Yes | irc.underworld.no | NotGLaDOS!STR_IDENT@ip188-241-117-24.cluj.ro.asciicharismatic.org | 3 |
06:26 🔗 NotGLaDOS I don't get it.
06:27 🔗 NotGLaDOS ...wait, is my nick NotGLaDOS?
06:27 🔗 arrith it is
06:27 🔗 Coderjoe yes
06:27 🔗 NotGLaDOS ...damn quassel playing tricks on me
06:28 🔗 NotGLaDOS And fixed, with hackery.
06:28 🔗 Coderjoe <NotGLaDOS> And fixed, with hackery.
06:28 🔗 Coderjoe try /nick YourNewNick
06:28 🔗 NotGLaDOS I had to force a module to send a false IRC command on the in direction, because it was displaying my nick as STR_IDENT
06:29 🔗 underscor Wheee
06:29 🔗 underscor Even farther!
06:29 🔗 underscor http://pastebin.com/MWsp8Fv3
06:29 🔗 arrith underscor: progress?
06:29 🔗 underscor arrith: Yep :)
06:29 🔗 arrith ahh pretty nice
06:30 🔗 arrith underscor: at this point are you echoing extracted data?
06:30 🔗 arrith or is there stuff it's doing on the bg that isn't echoed?
06:30 🔗 Coderjoe story ID number in xml?
06:31 🔗 Coderjoe (yes, you have the directory, but the number in the xml can help make sure that it can be correlated if separated and/or renamed)
06:33 🔗 underscor arrith: Everything its doing is echoed
06:33 🔗 underscor Coderjoe: Good idea, added
06:33 🔗 underscor it's*
06:36 🔗 bsmith093 is there arepo yet>
06:36 🔗 arrith underscor: what do you use to deal with xml in bash?
06:36 🔗 arrith bsmith093: kind of underscor's pet project at this point i think
06:36 🔗 underscor yeah
06:36 🔗 underscor xml handling is done in php
06:36 🔗 arrith bsmith093: also, you might want to doublecheck that you're getting some good nums and not all bad
06:37 🔗 arrith ohh
06:37 🔗 arrith underscor: cheater! :P
06:37 🔗 underscor $f = create_function('$f,$c,$a','
06:37 🔗 underscor $xml = new SimpleXMLElement("<?xml version=\"1.0\"?><{$root_element_name}></{$root_element_name}>");
06:37 🔗 underscor foreach($a as $k=>$v) {
06:37 🔗 underscor function assocArrayToXML($root_element_name,$ar)
06:37 🔗 underscor {
06:37 🔗 underscor if(is_numeric($k))
06:37 🔗 underscor $k="v".$k;
06:37 🔗 underscor if(is_array($v)) {
06:37 🔗 underscor $ch=$c->addChild($k);
06:37 🔗 underscor $f($f,$ch,$v);
06:37 🔗 underscor } else {
06:37 🔗 underscor $c->addChild($k,$v);
06:37 🔗 underscor }
06:37 🔗 underscor }');
06:37 🔗 underscor $f($f,$xml,$ar);
06:37 🔗 underscor return $xml->asXML();
06:37 🔗 underscor }
06:37 🔗 bsmith093 yeah i think im getting false ngs
06:37 🔗 arrith bsmith093: are you getting any positives?
06:38 🔗 bsmith093 random check
06:38 🔗 arrith arrith: as in any in goodlist.txt?
06:38 🔗 bsmith093 0005543
06:38 🔗 arrith underscor: ah, not too complicated
06:39 🔗 arrith underscor: could probably rewrite that in bash..
06:39 🔗 bsmith093 in goodlist but not there in firefox
06:39 🔗 SketchCow WHAT THE HELLO HI
06:39 🔗 bsmith093 check please
06:39 🔗 SketchCow OK, my internet is back.
06:39 🔗 bsmith093 hey hey hey its SkeeetchCow
06:39 🔗 SketchCow Jesus, lot of backlog.
06:39 🔗 NotGLaDOS I know.
06:39 🔗 NotGLaDOS I looked a the channel and thought "Fuck it."
06:40 🔗 bsmith093 fanfiction.net/s/0005543
06:40 🔗 SketchCow I can't do that.
06:40 🔗 SketchCow So bsmith093 is trying to save fanfiction?
06:40 🔗 arrith bsmith093: yep dang. that is a false positive
06:40 🔗 bsmith093 well me and underscor
06:40 🔗 bsmith093 so its not just me
06:40 🔗 arrith SketchCow: bsmith093 really wants to, underscor is doing a lot of stuff, i'm poking around with parts of it and Coderjoe is helping
06:40 🔗 SketchCow So my question is, what's going on?
06:40 🔗 SketchCow It's shutting down?
06:40 🔗 underscor No
06:40 🔗 arrith SketchCow: no, all pre-emptive
06:40 🔗 bsmith093 ffnet seq id check
06:41 🔗 underscor Premptive
06:41 🔗 SketchCow OK.
06:41 🔗 SketchCow So remember, if it's pre-emptive, don't rape it.
06:41 🔗 bsmith093 if it was id make sure google knew
06:41 🔗 underscor SketchCow: ofc
06:41 🔗 SketchCow That's all I can really contribute.
06:41 🔗 SketchCow Looks javascript free.
06:41 🔗 underscor I like raping sites though :(
06:41 🔗 arrith SketchCow: they have a pretty light blocking trigger finger i think. at least they blocked bsmith093 somehow for some reason i think
06:41 🔗 SketchCow Have we considered just pinging them?
06:41 🔗 SketchCow Going HEY WE WANT A COPY
06:41 🔗 SketchCow Or no
06:41 🔗 bsmith093 and it has atome feeds for everything ! whooA!
06:41 🔗 arrith not sure if anyone thought of trying that
06:41 🔗 underscor They block IA
06:42 🔗 underscor so they're a hostile target
06:42 🔗 underscor <sunglasses>
06:42 🔗 bsmith093 i keep saying use googlebot
06:42 🔗 arrith yeah, i think based on blocking IA people figured to not try
06:42 🔗 underscor bsmith093: No, I mean
06:42 🔗 underscor They block the wayback machine
06:42 🔗 underscor It doesn't actually stop us
06:42 🔗 bsmith093 so switch the useragent
06:42 🔗 underscor ???
06:42 🔗 arrith one guy, Teaspoon / tsp i guess worked on this a little while ago but he hasn't been seen and people haven't seen how far he got
06:43 🔗 underscor bsmith093: What do you mean? Wayback Machine isn't going to switch its useragent...
06:43 🔗 underscor They obey robots.txt for legal reasons
06:43 🔗 arrith bsmith093: they haven't blocked underscor's efforts afaik
06:43 🔗 underscor I'm up to 71k
06:43 🔗 arrith bsmith093: so he doesn't need to change his UA, at least not yet
06:43 🔗 bsmith093 underscor: ohhh, ok then that makes much more sense
06:43 🔗 underscor (just checking IDs
06:43 🔗 underscor )
06:44 🔗 bsmith093 i suppose that so they can say well we didnt save u cause u blocked us
06:44 🔗 arrith underscor: are you still doing that check for "Last" in the header?
06:44 🔗 underscor Yeah
06:44 🔗 arrith underscor: since i think bsmith093 just found a false positive, he checked
06:44 🔗 arrith here
06:44 🔗 arrith underscor: http://www.fanfiction.net/s/0005543/
06:45 🔗 bsmith093 still dead for me
06:45 🔗 arrith oh wait
06:45 🔗 arrith i might've mixed up dead and not dead
06:45 🔗 underscor Not a story
06:45 🔗 underscor var=`curl -s -I http://www.fanfiction.net/s/0005543/|grep Last`;if [ -z $var ]; then echo "Not a story";else echo "Story";fi
06:45 🔗 arrith bsmith093: goodlist is badlist and badlist is goodlist
06:45 🔗 underscor Doesn't trip up mine
06:45 🔗 bsmith093 ure kidding me!?
06:45 🔗 underscor lol
06:45 🔗 arrith bsmith093: just rename them when it's done :P
06:45 🔗 bsmith093 checing bad list then
06:45 🔗 arrith i blame the error codes
06:46 🔗 bsmith093 0009863
06:47 🔗 bsmith093 its good u did reverse the polarity
06:47 🔗 bsmith093 yay tom baker
06:47 🔗 arrith bsmith093: yeah looks like a story
06:47 🔗 arrith heh, yeahh
06:47 🔗 arrith i did say i didn't check the script :P
06:47 🔗 bsmith093 stop and switch or keepgoing
06:47 🔗 underscor 77000 now
06:48 🔗 bsmith093 underscor: what are those numbers u keep giving
06:48 🔗 underscor id's I've checked up to
06:48 🔗 SketchCow OK, this needs to go to another channel.
06:48 🔗 underscor for story/not story
06:48 🔗 underscor :(
06:48 🔗 SketchCow #fanboys or #fanfriction
06:48 🔗 bsmith093 HOW R U THAT FAST?
06:48 🔗 underscor second one
06:48 🔗 bsmith093 K THEN
06:48 🔗 bsmith093 caps
06:49 🔗 arrith bsmith093: keep going
06:49 🔗 bsmith093 k
06:49 🔗 arrith SketchCow: you are good with those names
06:49 🔗 bsmith093 remind me i 7hrs hen its done
07:05 🔗 arrith reading the logs in the topic brought up a question that i didn't see answered: does archive.org archive porn?
07:06 🔗 arrith or IA rather
07:08 🔗 underscor ^
07:09 🔗 underscor SketchCow
07:09 🔗 arrith SketchCow: does the Internet Archive archive porn?
07:09 🔗 underscor lol
07:23 🔗 DFJustin http://www.archive.org/details/70sNunsploitationClipsNunsBehavingBadlyInBizarreFetishFilms
07:23 🔗 Coderjoe ugh
07:24 🔗 Coderjoe you had to link to the one I had seen before
08:04 🔗 SketchCow Why do people ask that
08:04 🔗 SketchCow Since Porn simply means "Material considered sexually or morally questionable by random community standards", of course it does.
08:05 🔗 SketchCow So does google and so does facebook
08:05 🔗 arrith good
08:06 🔗 arrith SketchCow: what irc bouncer do you use? znc or irssi maybe?
08:13 🔗 SketchCow Irssi
08:14 🔗 SketchCow Pumped through a screen session
08:15 🔗 arrith yeahh. i gotta learn irssi.
08:27 🔗 SketchCow It's not so bad.
08:27 🔗 SketchCow I use a screen session that puts the channel list along the right, like mIRC used to.
08:27 🔗 SketchCow Also, I put this on the machine that runs textfiles.com and a bunch of services.
08:27 🔗 SketchCow So I know INSTANTLY if something's wrong with the machine.
08:30 🔗 arrith well you can't really right click on things and do other gui-ish stuff that you can in xchat. i can totally see me using irssi for logging but for everyday stuff i'm not sure yet
08:35 🔗 dnova irssi supremacy
08:41 🔗 SketchCow You can if you have an ssh client that makes URLs alive.
08:41 🔗 SketchCow And people? Fuck people
08:41 🔗 SketchCow They're all the same
08:42 🔗 SketchCow who needs to right click on them
08:45 🔗 dnova SketchCow: are you shooting all video on dslrs?
08:46 🔗 SketchCow Yes.
08:46 🔗 dnova that became a thing pretty quickly
08:51 🔗 chronomex apparently it doesn't suck much at all.
08:51 🔗 chronomex sensor is sensor, and dslrs often have nice sensor.
08:52 🔗 dnova and excellent glass
08:52 🔗 dnova and more variety
08:52 🔗 dnova I don't think anything about it sucks
09:08 🔗 SketchCow Some things suck.
09:08 🔗 SketchCow But they're quite doable for what they are.
09:09 🔗 SketchCow I have to also point out that I was trained, at 20, to be able to unload, canister, and then reload and thread 16mm film into a set of reels, all while inside a leather bag so they wouldn't be exposed to light.
09:09 🔗 SketchCow Comparitively, this new material is even better than what that was giving me.
09:10 🔗 dnova what are the things that suck?
09:10 🔗 dnova I know very little about video production
09:12 🔗 SketchCow http://www.youtube.com/watch?v=mEdBId3OuuY
09:16 🔗 dnova no gui at the moment
09:42 🔗 ersi arrith: I'm using irssi in a screen session, and I'm saying you're wrong. I'm clickin' links like a darned Mechanical Turk on speed
09:43 🔗 ersi Most terminals convert text that it things is a link, to a clickable element. PuTTY (Win/*nix), gnome-terminal, rxvt, xterm, iTerm and co all do
10:16 🔗 arrith were passwords ever reset on the wiki?
10:16 🔗 arrith i haven't logged in for a while and i'm having trouble. my password autocompletes but the login doesn't work
10:22 🔗 SketchCow Maybe
10:22 🔗 SketchCow I cleared out one-offs who did nothig
10:24 🔗 arrith SketchCow: could you look into User:Arrith really quick?
10:24 🔗 dnova just reset your password
10:24 🔗 arrith dnova: i was looking for a page for that
10:24 🔗 arrith didn't find one
10:25 🔗 ersi should be linked on the login page
10:26 🔗 arrith ersi: you sure it's there? might just be me but i'm not seeing anything. just "Create an account."
10:29 🔗 ersi hmm
10:30 🔗 ersi huh, weird. alright.. there's no special page for that
10:30 🔗 dnova really?
10:30 🔗 arrith well the signup doesn't have an email, usually resetting a password involves sending an email out
10:32 🔗 dnova welp.
11:00 🔗 kin37ik how did the crawls from yesterday turn out?
11:11 🔗 arrith well until further notice i am now arrith1 on the wiki
11:17 🔗 dnova what is python used for in the splinder download process?
12:15 🔗 emijrp testing script to download all wikkii.com wikis
12:15 🔗 emijrp WIKIFARMS ARE NOT TRUSTWORTHY.
12:17 🔗 chronomex k
12:17 🔗 arrith indeed
12:19 🔗 arrith btw if any Administrators get a chance, could one of them merge User:Arrith with User:Arrith1 please?
12:19 🔗 arrith oh wait, seems only SketchCow can do that
12:20 🔗 emijrp i guess no
12:20 🔗 emijrp what nick do you want?
12:20 🔗 emijrp i mean, sysop can merge pages
12:21 🔗 emijrp user accounts i think it is impossible
12:22 🔗 dnova if someone has a spare moment, could they put "Uploaded; still downloading more" next to my name in the splinder status table on the wiki?
12:22 🔗 arrith emijrp: http://www.mediawiki.org/wiki/Extension:User_Merge_and_Delete says "merge (refer contributions, texts, watchlists) of a first account A to a second account B"
12:22 🔗 emijrp mm, looks like possible mergin accounts
12:22 🔗 arrith which would be good enough for me
12:22 🔗 emijrp but are extensions, which need to be installed separately
12:23 🔗 emijrp not sure if jason is going to install it only for you
12:23 🔗 emijrp : )))
12:23 🔗 arrith emijrp: http://archiveteam.org/index.php?title=Special:Version says it's already installed :P
12:24 🔗 arrith although i am still curious why i can't get in on my original acct
12:24 🔗 emijrp ah man, he used it to merge spam users, i forgot
12:24 🔗 arrith dnova: done
12:24 🔗 dnova thanks
12:25 🔗 arrith yeah on that topic, there are a lot of pages with {{delete}}
12:25 🔗 arrith and by a lot i mean more than 5 heh
12:27 🔗 dnova we still need splinder downloaders
12:28 🔗 emijrp why my userpage on AT wiki has 8000 pageviews?
12:28 🔗 arrith emijrp: it's a very nice page
12:28 🔗 arrith btw, the time stamp on the latest archiveteam.org wiki dump at http://www.archiveteam.org/dumps/ is 15-Mar-2009 :|
12:28 🔗 arrith hardly "weekly"
12:31 🔗 emijrp http://code.google.com/p/wikiteam/downloads/list?can=2&q=archiveteam
12:31 🔗 emijrp i do weekly, but i dont upload them
12:31 🔗 arrith aha, didn't think of wikiteam but now that i think about it that makes sense
12:31 🔗 arrith emijrp: images though?
12:31 🔗 emijrp i dont upload images to googlecode
12:32 🔗 emijrp only 4gb of hosting
12:32 🔗 emijrp http://www.archive.org/search.php?query=wikiteam%20archiveteam
12:32 🔗 arrith hm. i'm not sure where a good host would be. my first thought is archive.org
12:33 🔗 arrith hmm that's good but, are those maintained? seems to be from August and July
12:33 🔗 emijrp http://www.referata.com/ another wikifarm unstable
12:33 🔗 dnova what is a wikifarm
12:34 🔗 emijrp free hosting for wikis
12:34 🔗 dnova oh.
12:36 🔗 emijrp referata is for semantic wikis
12:36 🔗 emijrp cool stuff
14:06 🔗 dnova downloading it:pornoromantica
14:16 🔗 emijrp wtf is that
14:19 🔗 dnova I sure do not know
14:45 🔗 rude___ SketchCow have you seen this filter for eliminating aliasing in 5Dmk2 video? http://www.mosaicengineering.com/products/vaf-5d2.html
15:02 🔗 underscor rude___: That's rad!
15:06 🔗 rude___ yup, funny that I've never noticed aliasing that bad in 5Dmk2 video before seeing the demos
15:06 🔗 underscor http://www.m0ar.org/6346
15:06 🔗 underscor This is amazing
15:06 🔗 underscor (it's not actually porn)
15:11 🔗 dnova pornoromantica is up to 29mb.
15:16 🔗 emijrp wtf is that
15:17 🔗 dnova someone's splinder account
15:17 🔗 dnova up to 32mb now
15:20 🔗 Ymgve underscor: how long do I have to watch before it gets amazing
15:21 🔗 underscor like 2 mins
15:21 🔗 underscor right after she says "I think he wants to fuck me"
15:21 🔗 Ymgve ah, there
15:21 🔗 emijrp and... ?
15:23 🔗 dnova emijrp: ?
15:24 🔗 emijrp she says that and what happens?
15:24 🔗 dnova oh. I have no idea.
15:24 🔗 emijrp I'm an archivist.
15:24 🔗 emijrp I'm confused.
15:24 🔗 dnova he archives the shit out of her
15:24 🔗 emijrp WHAT.
15:27 🔗 dnova emijrp, download some splinder
15:27 🔗 dnova these last bunch of profiles are a real bitch
15:27 🔗 emijrp If I download splinder, who the hell is going to download wikis?
15:27 🔗 underscor not really
15:28 🔗 underscor you have to watch it!
15:28 🔗 underscor I don't want to spoil it
15:35 🔗 Schbirid i downloaded ~14 gb splinder, some might be unfinished. cant continue. where to put it?
15:36 🔗 dnova ask SketchCow for a slot and then use the upload-dld.sh script
15:36 🔗 dnova upload-finished.sh rather
15:44 🔗 Schbirid ok
15:44 🔗 Schbirid SketchCow: i need a place to upload splinder downloads
17:38 🔗 SketchCow You got it.
17:40 🔗 SketchCow Has anyone else in here gotten calls/contact from reporters wanting to do an article on archive team?
17:41 🔗 yipdw nope
17:41 🔗 dnova me either
17:44 🔗 SketchCow A fairly terrible article is going to come out, and I apologize in advance for it.
17:44 🔗 dnova how did you find out about it
17:44 🔗 SketchCow I did an interview for it.
17:45 🔗 SketchCow I didn't realize who was writing it, he used an intermediary who did not identify him as the author, after I repeatedly refused to interact with him.
17:45 🔗 SketchCow Now I found out and I have been yelling.
17:45 🔗 SketchCow I'm good at yelling.
17:45 🔗 underscor lol
17:46 🔗 yipdw oh, was it Talmudge and/or Schwartz
17:46 🔗 closure they're in your voicemail, archiving your archiving
17:48 🔗 SketchCow Yes
17:49 🔗 underscor yipdw: You know them?
17:49 🔗 yipdw sneaky bastards
17:49 🔗 yipdw not personally
17:49 🔗 yipdw I do remember seeing Mattattattattattattattattahias Schwartz's article on Internet trolling a while back, though
17:50 🔗 underscor hahaha
17:52 🔗 yipdw I figure if he wants to invoke the pumping lemma on his name, it's fair game
17:52 🔗 * closure listens to a 1 gb WD green sata drive fail to spin up in my external dock
17:53 🔗 underscor :(
17:53 🔗 closure wonder if it will do better on internal SATA.. will have to try later
17:54 🔗 closure huh, on one dock it does nothing, on the other I can hear the motor fail to quite spin it
17:55 🔗 SketchCow Or you're torturing it and it's randomly going up and down.
17:56 🔗 dnova us? torture hard drives?
17:56 🔗 dnova never!
17:57 🔗 SketchCow Schbirid: Need the slot!
18:01 🔗 closure oh good, everything on this drive is still present on some 50 or so dvds. urk.
18:01 🔗 dnova haha
18:29 🔗 dnova god damnit I just lost about 15 hours worth of downloading.
18:29 🔗 SketchCow See, that's what you get
18:29 🔗 SketchCow "ha ha you lost so much stOH FUCK I LOST MY STUFF"\
18:29 🔗 SketchCow Jesus did it to you
18:30 🔗 SketchCow Jesus, he likes insta-parables these days
18:30 🔗 dnova I wasn't laughing at closure!! well I kinda was.
18:31 🔗 dnova argh!
18:32 🔗 SketchCow Jesus knew
18:35 🔗 underscor hahaha
18:39 🔗 emijrp haha closure and dnova lost stuff
18:40 🔗 emijrp i lost 1.5TB some months ago, so, it cant get worse
18:42 🔗 emijrp obviously, currently unique stuff is being destroyed in the Internet, what is your estimate in Megabytes?
18:43 🔗 emijrp mb/hour
18:46 🔗 underscor Thank you for placing your order with the Comprehensive Large-Array data Stewardship System.
18:52 🔗 dnova what's that?
18:57 🔗 emijrp I sure do not know
18:58 🔗 dnova haha
18:58 🔗 dnova I lost pornoromantico :(
18:58 🔗 dnova it was over 400mb
18:58 🔗 SketchCow It's the Big Brother program for fat people
19:00 🔗 underscor hahahah
19:00 🔗 underscor It's NOAA's tape access system
19:00 🔗 dnova you want noaa tapes?
19:00 🔗 underscor I need a piece of historical data for oceanography class
19:01 🔗 dnova awesome.
19:07 🔗 emijrp Get your piece of oceanographic data http://en.wikipedia.org/wiki/Exploding_whale
19:12 🔗 underscor I'm digging through noaa's various public FTP servers
19:12 🔗 underscor There so much old cruft and stuff, it's really cool
19:12 🔗 underscor TODO: Ring bob and tell him to actually upload the data here
19:12 🔗 underscor This directory contains files related to the March 1993 Blizzard. It
19:12 🔗 underscor includes a report on the storm and related data files described in
19:12 🔗 underscor the report. NCDC's homepage provides easy access to this directory.
19:13 🔗 underscor That file was updated 8/15/97
19:13 🔗 underscor and the data is still not there
19:13 🔗 underscor hahaha
19:13 🔗 dnova damnit, bob
19:15 🔗 Schbirid damn, only 5 of 14gb were "finished" data
19:15 🔗 bsmith093 noaa has public ftp archives ?!
19:15 🔗 underscor ftp.ncdc.noaa.gov, ftp.ngdc.noaa.gov
19:16 🔗 underscor ftp.nodc.noaa.gov
19:18 🔗 emijrp If that FTP is up since 1997, it is almost as trustworthy as Internet Archive.
19:25 🔗 bsmith093 anyone want to catch me up on how we're archiving ffnet
19:25 🔗 bsmith093 im also in #fanfriction
19:29 🔗 godane looks like podtrac doesn't keep ipod format of crankygeeks after 100 episode number
19:29 🔗 godane likely mpeg4 is hosted by pcmag
19:30 🔗 bsmith093 are u grabbing all their feeds, ausio ogg, etc
19:30 🔗 bsmith093 *audio
19:30 🔗 godane no
19:30 🔗 godane mp3 is down
19:31 🔗 godane can you guys please start mirroring crankygeeks
19:31 🔗 godane i didn't think it was this bad yet
19:31 🔗 bsmith093 when i checkedd only 4 mps were dead
19:32 🔗 godane its only 100-103 that are down
19:32 🔗 godane ok
20:09 🔗 SketchCow Anyone have idea to do a sed that does nothing BUT A-Za-z0-9 ?
20:09 🔗 PatC What's a sed?
20:10 🔗 SketchCow Found it. sed 's/[^a-zA-Z0-9]//g'
20:12 🔗 godane looks like all links to 101-103 are died
20:12 🔗 godane for crankygeeks
20:13 🔗 Schbirid SketchCow: alternatively grep -Eo '[a-zA-Z0-9]' might do what you want
20:23 🔗 SketchCow I would move ./Amiga Dream 01 - Nov 1993 - Page 32.jpg to AmigaDream01-Nov1993-Page32.jpg
20:23 🔗 SketchCow I would move ./Amiga Dream 01 - Nov 1993 - Page 67.jpg to AmigaDream01-Nov1993-Page67.jpg
20:23 🔗 SketchCow I would move ./Amiga Dream 01 - Nov 1993 - Page 22.jpg to AmigaDream01-Nov1993-Page22.jpg
20:23 🔗 SketchCow I would move ./Amiga Dream 01 - Nov 1993 - Page 15.jpg to AmigaDream01-Nov1993-Page15.jpg
20:23 🔗 SketchCow I would move ./Amiga Dream 01 - Nov 1993 - Page 45.jpg to AmigaDream01-Nov1993-Page45.jpg
20:23 🔗 SketchCow Tah dah.
20:25 🔗 emijrp SketchCow: did you read my suggestion for human ocr at IA ?
20:25 🔗 SketchCow That you think the non-human OCR sucks and it should be replaced?
20:25 🔗 SketchCow Was there more to it?
20:26 🔗 emijrp replaced with a colaborative ocr, the technology for that exists
20:26 🔗 SketchCow I see.
20:27 🔗 emijrp i think i read about IA saying their books have OCR for blind people
20:27 🔗 emijrp but... checking the .txt files on scanned books, O_O
20:32 🔗 emijrp IA has a great potential, but most of its content is in useless formats
20:33 🔗 emijrp I dont want to read a book in JPG/DJVU, I want an epub or a correct txt
20:33 🔗 SketchCow You strike at the heart of an endemic issue with IA.
20:34 🔗 SketchCow I need to do a quick errand.
20:34 🔗 SketchCow But it's a big issue and there may not be an easy solution.
20:34 🔗 SketchCow It's a political technical issue
20:46 🔗 emijrp From Wikipedia: Metapedia is a white nationalist and white supremacist,[2] extreme right-wing and multilingual online encyclopedia.[3][4][5]
20:46 🔗 emijrp And it is a wiki.
20:46 🔗 emijrp What is your opinion about archiving that?
20:46 🔗 emijrp Discuss.
20:47 🔗 dnova archive it
20:48 🔗 Schbirid archive it
20:54 🔗 emijrp Is it illegal to upload that into IA?
20:55 🔗 emijrp I think it depends on servers location.
20:55 🔗 dnova why would it be illegal?
20:56 🔗 emijrp Speaking in positive tone about nazism is illegal in some jurisdictions.
21:21 🔗 SketchCow Download it.
21:21 🔗 SketchCow archive it.
21:51 🔗 arkhive http://arstechnica.com/gaming/news/2011/11/gamepro-magazine-and-website-to-shutter-next-month-1.ars
21:51 🔗 chronomex emijrp: there is no law against naziism in the United States of America
21:52 🔗 arkhive December 5th
21:52 🔗 chronomex "Congress shall make no law ... abridging the freedom of speech, or of the press ..."
22:44 🔗 godane i'm using wget-warc to backup crankygeeks web site
23:07 🔗 bsmith093 for the ffnet archiving effort, where the list of good id#s?
23:08 🔗 arkhive Will archiveteam backup GamePro, Waves on Google Wave or Knol?
23:08 🔗 arkhive And did you guys backup Aardvark?
23:09 🔗 godane is fanfiction.net going down?
23:11 🔗 bsmith093 godane: not, preemtive
23:11 🔗 godane ok
23:13 🔗 chronomex arkhive: I've heard some buzz about Knol. I haven't heard much about Wave. Are you interested in starting a subcommittee?
23:13 🔗 godane can wget-warc sed out the main website?
23:13 🔗 bsmith093 id like to help with knol, if theres a script, cause the button option sounds reallyreally slow
23:13 🔗 godane i want it to work locally and be possible to host it localy
23:15 🔗 chronomex godane: the purpose of WARC is to preserve as close to the source material as possible. as such, altering a page before storing it into the WARC file is to be avoided. wget can --convert-links, but I don't know how this affects the .warc output.
23:15 🔗 alard chronomex: --convert-links is safe to use with warc.
23:16 🔗 emijrp arkhive: we have the metadata for knols, 700,000, now a scraper for the real content is needed
23:17 🔗 arkhive chronomex: I'd like to help back Knol up.
23:19 🔗 arkhive emijrp: can we write a script and have a server tell each connected client what knol to fetch. Like we did for backing up Google Video's Videos?
23:20 🔗 arkhive I'm not too good on writing scripts though.
23:22 🔗 emijrp im not sure, i dont know how people make that cool distributed scripts
23:22 🔗 chronomex alard: excellent.
23:23 🔗 godane wget-warc is not converting links
23:23 🔗 bsmith095 since everyone seems to be in here anyway, wheres the list of good id#s for ffnet
23:23 🔗 godane i still get http://www.crankygeeks.com/favicon.ico
23:23 🔗 godane like links
23:23 🔗 bsmith095 -k
23:23 🔗 godane i did
23:23 🔗 godane index.html download
23:24 🔗 godane and i still get those links
23:24 🔗 bsmith095 wget-warc -mcpk
23:25 🔗 emijrp arkhive: channel is #klol
23:27 🔗 godane i'm still getting http://www.crankygeeks.com/favicon.ico links
23:27 🔗 godane i also don't want it redownloading everything
23:28 🔗 godane wget "http://www.crankygeeks.com/" --warc-file="crankygeeks" --no-warc-compression -mcpk
23:28 🔗 godane thats what i used
23:28 🔗 godane my wget-warc is wget
23:29 🔗 arkhive GamePro's down December 5th...We should also start that. I've got storage and computers I can dedicate to backing it up.
23:30 🔗 dnova how big is gampero
23:30 🔗 dnova gamepro
23:30 🔗 dnova and what is it
23:31 🔗 arkhive A gaming blog, news, magazine website
23:31 🔗 arkhive Not sure How big Gamepro's site is. Not sure how to check either.
23:31 🔗 emijrp delicious was archived?
23:32 🔗 alard godane: Be aware that compressing warcs is preferably done while wget is downloading, not with a post-processing gzip step. So if you do intend to gzip later, it's better to remove the --no-warc-compression
23:33 🔗 godane alard: i want to make sure i don't get the same problem i have now
23:33 🔗 godane if i compressed i will not know if it did corrently
23:34 🔗 alard You can gunzip?
23:34 🔗 godane not until i'm don't download
23:34 🔗 godane *done
23:34 🔗 yipdw hmm
23:34 🔗 yipdw guess I'll snag a copy of http://wikileaks.org/the-spyfiles.html
23:35 🔗 godane i'm not downloading 200mb website to find out it didn't convert the links
23:37 🔗 emijrp yipdw: thanks, now this channel is being monitored by CIA, NSA and ETs.
23:37 🔗 bsmith095 ET? really, cool :)
23:37 🔗 yipdw emijrp: bomb, bin Laden, airplane
23:37 🔗 alard godane: You should do what you think is best, of course, but: 1. you can gunzip the warc.gz while downloading (it'll just print an error at the end, but you will see what's been downloaded so far) 2. bear in mind that the wget link conversion is always done at the end of the download, if I remember correctly.
23:37 🔗 balrog_ oh
23:37 🔗 bsmith095 correct
23:37 🔗 balrog_ anyone know of tools for scraping MediaWiki sites?
23:38 🔗 alard godane: just do gunzip -c my.warc.gz | less
23:38 🔗 emijrp balrog: I call them WikiTeam tools.
23:38 🔗 chronomex balrog: wikiteam produced some good tools. look in the archiveteam wiki
23:39 🔗 godane alard: i may never finish download cause it keeps downloading blank previos pages
23:39 🔗 balrog awesome. thanks!
23:40 🔗 godane just to go back to the last 15 episodes at over 1500 even when there is only 237 episodes
23:40 🔗 emijrp balrog: which wiki?
23:40 🔗 balrog none that Archive Team would be interested in. it's not going away but I need a backup for various purposes
23:40 🔗 balrog ATTENTION: This wiki does not allow some parameters in Special:Export, so, pages with large histories may be truncated
23:40 🔗 balrog any way to get around that?
23:42 🔗 emijrp only affects for +1000 revisions pages
23:42 🔗 emijrp not the common case
23:42 🔗 balrog ahh, yeah shouldn't have any of those here at all.
23:42 🔗 chronomex balrog: I think you'll have to upgrade the wiki to fix that, but it's rare to cause problems.
23:42 🔗 balrog :)
23:42 🔗 balrog it's not a wiki I have access to.
23:43 🔗 balrog but yeah shouldn't have +1000-rev pages.
23:43 🔗 bsmith095 underscor: do u have the list of valid ids for ffnet
23:43 🔗 godane i really HATE mirror websites now
23:44 🔗 godane there just always keep remirror the full site when i just what the update crap
23:44 🔗 godane like xkcd.com
23:44 🔗 godane httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo -n --disable-security-limits -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -#L500000000 --update xkcd.com
23:45 🔗 godane thats what i did
23:45 🔗 bsmith095 i have an xkcd images script if anyione wants that
23:45 🔗 godane i thought the --update will not fucking redownload files
23:46 🔗 bsmith095 why backup xkcd?
23:46 🔗 bsmith095 the whole thing?
23:46 🔗 godane again
23:46 🔗 godane i want to local host things on my local lan
23:46 🔗 bsmith095 wget-warc -mcpk --random-wait xkcd.com
23:47 🔗 godane but that will not download imgs.xkcd.com i think
23:47 🔗 emijrp http://arxiv.org/ is a great site but they have counter-archivists measures
23:52 🔗 godane looks like stupid imgs.xkcd.com can't be mirrored with wget-warc
23:53 🔗 bsmith093 if u want imgs.xkcd.com here google this yaxkcdds.sh

irclogger-viewer