#archiveteam 2011-12-02,Fri

↑back Search

Time	Nickname	Message
00:08 ^🔗	Paradoks	Syncronet is still being developed. I re-learned this in the past few days while reminiscing about my old BBS software ( http://tech.groups.yahoo.com/group/KBBS/ )
00:08 ^🔗	Paradoks	Err, clients. Whoops.
00:09 ^🔗	Paradoks	I figured people just used telnet.
00:14 ^🔗	Coderjoe	if net connected, usually just telnet or the like
00:14 ^🔗	Coderjoe	(unless you want to connect to a system with RIP or some custom graphical client program)
00:15 ^🔗	Coderjoe	(mmmm RIPTERM... something I only used twice or so)
00:20 ^🔗	bsmith093	still need phone #s
00:20 ^🔗	bsmith093	\??
00:21 ^🔗	Coderjoe	no idea where to get current numbers
00:24 ^🔗	bsmith093	are there current numbers?
00:24 ^🔗	bsmith093	how would it work with telnet
00:24 ^🔗	Coderjoe	you need to know a hostname or ip address to telnet to
00:25 ^🔗	Coderjoe	miku.acm.uiuc.edu
00:27 ^🔗	DFJustin	you can also use dosbox to redirect telnet to an emulated COM port and use old DOS terminal software
00:28 ^🔗	Coderjoe	hahahah
00:29 ^🔗	Coderjoe	"taking a moon lander out to do riceboy drifting out in the parking lot"
00:30 ^🔗	DFJustin	oh it's even easier than I thought in dosbox, you just dial an IP address instead of a phone number
00:32 ^🔗	SketchCow	Thank you, rude___
00:37 ^🔗	bsmith093	Coderjoe: haha very funny
00:39 ^🔗	bsmith093	nyancat or whtatever its called
00:41 ^🔗	bsmith093	although nice trick, i didnt even know the terminal could recieve color
00:41 ^🔗	Coderjoe	xterm-color, ANSI, etc
00:42 ^🔗	bsmith093	SketchCow: anything besides the insanely huge mobilme to archive, something smaller? perhaps fanfiction.net? would love to help with that :) tried wget -mcpk, didnt get all of it though, weird
00:44 ^🔗	bsmith093	even with the useragent workaround, stopped at 300Kfiles, and i know there are at least 2million stories
00:44 ^🔗	db48x	Coderjoe: did you see that episode of Top Gear where they actually drove the real moon buggy?
00:44 ^🔗	Coderjoe	no.
00:45 ^🔗	Coderjoe	i want to now
00:45 ^🔗	db48x	I recommend it :)
00:45 ^🔗	db48x	it's the one they developed for a future moon mission, that may or may not ever be used
00:45 ^🔗	db48x	it's got a pressurized cabin
00:45 ^🔗	db48x	6-wheel drive
00:45 ^🔗	bsmith093	theres another moon mission?!
00:45 ^🔗	db48x	full independant suspension
00:46 ^🔗	bsmith093	i want one !
00:46 ^🔗	bsmith093	coolest...moon buggy...ever!!
00:46 ^🔗	db48x	each wheel is independantly steerable
00:46 ^🔗	db48x	the console inside gives you diagnosics on each wheel, showing you how much power is being applied and so on
00:47 ^🔗	bsmith093	wait... why independently?
00:47 ^🔗	bsmith093	wouldnt you usually be pointing them all in roughly the same direction at any given time?
00:47 ^🔗	db48x	the wheels might not all be touching the ground at the same time
00:47 ^🔗	db48x	they have a lot of vertical travel so that you can go over rocks
00:47 ^🔗	db48x	yea, in general
00:47 ^🔗	bsmith093	oh.. yeah... duh moon grav.. wow i feel stupid
00:47 ^🔗	db48x	but you might want to spin in place
00:48 ^🔗	bsmith093	like donut, or spin actually in place
00:48 ^🔗	Coderjoe	someone hasn't seen things like zero-point-turn commercial mowers and stuff
00:48 ^🔗	Coderjoe	(though those go by a different means, like tanks)
00:48 ^🔗	bsmith093	ZERO POINT TURN?!?? for a LAWN MOWER!?! why?
00:48 ^🔗	db48x	yea
00:49 ^🔗	db48x	heh
00:49 ^🔗	Coderjoe	commercial mowers. so they can mow a field faster and get more jobs in during a day
00:49 ^🔗	bsmith093	thats like, i have a sleep disorder, oh heres a TIME MACHINE!
00:49 ^🔗	db48x	lol, great reference
00:49 ^🔗	bsmith093	oh well commercial mowers, well that acually makes sense
00:49 ^🔗	db48x	hmm, think chapter 78 is up yet?
00:49 ^🔗	Coderjoe	more jobs means they can get mor income
00:50 ^🔗	bsmith093	yeah, hpmor rocks
00:51 ^🔗	db48x	http://www.topgear.com/uk/photos/topgear-moon-drive?imageNo=1
00:51 ^🔗	bsmith093	heres all of it on one convenient file
00:52 ^🔗	db48x	I have it on my phone as an ebook too
00:52 ^🔗	bsmith093	neat
00:52 ^🔗	Coderjoe	eh, what is this?
00:52 ^🔗	bsmith093	vague much
00:52 ^🔗	db48x	Coderjoe: Harry Potter and the Methods of Rationality?
00:53 ^🔗	db48x	oh, my bad. it's 12-wheel drive
00:53 ^🔗	db48x	6 pairs of wheels
00:53 ^🔗	Coderjoe	no, i meant the "hpmor"
00:54 ^🔗	db48x	yea, HPMoR, Harry Potter and the Methods of Rationality
00:54 ^🔗	bsmith093	ACRONYMS ARE U=YOUR FRIEND
00:54 ^🔗	bsmith093	stupid caps
00:54 ^🔗	db48x	Coderjoe: http://www.fanfiction.net/s/5782108/1/Harry_Potter_and_the_Methods_of_Rationality
00:57 ^🔗	db48x	Coderjoe: I recommend it even if you don't generally like fan fiction, but be forewarned: your laughter will annoy your housemates/coworkers
01:00 ^🔗	bsmith093	every story has its own unique id number , they are apparently sequential, hey come to think of it i have a fanfiction downloader, that takes link lists
01:01 ^🔗	bsmith093	ill just generate all possible story ids and plug them into that
01:02 ^🔗	db48x	:)
01:03 ^🔗	db48x	integrate it with the scripts we used for splinder/mobileme/anyhub
01:03 ^🔗	db48x	then we can all help out
01:07 ^🔗	bsmith093	not really sure how, what im doing ( or trying to do) will just grab all the stories, ( hopefully) check link by link and download into a text file by category author and storyname, using this little binary blob linux app i found here, fanfictiondownloader.net
01:12 ^🔗	bsmith093	ok well my generator is choking, trying to pump out 10million links at once, so basically whats the command to generate these http://www.fanfiction.net/s/[0000000-9999999]/1/ note the regex im trying to use
01:12 ^🔗	bsmith093	0 to 9 999 999
01:13 ^🔗	bsmith093	probably nowhere near that many storeies but the id's are all over the place
01:15 ^🔗	Coderjoe	wow. I never knew there was a printer acessory for the game boy
01:15 ^🔗	db48x	heh
01:15 ^🔗	Coderjoe	(and I had a 1st gen game boy)
01:18 ^🔗	db48x	there was a printer available for my favorite calculator
01:20 ^🔗	arrith	bsmith093: this is one way, but it takes forever: for i in {0000000..9999999}; do echo $i >> file.txt; done
01:21 ^🔗	arrith	actually
01:22 ^🔗	arrith	bsmith093: echo {0000000..9999999} \| tr ' ' '\n' > file.txt
01:22 ^🔗	arrith	should be a lot faster
01:24 ^🔗	bsmith093	ok but with the links around the numbers
01:24 ^🔗	arrith	yeah. took a little less than a minute just now. generated a 70 megabyte file
01:24 ^🔗	arrith	oh
01:24 ^🔗	bsmith093	sorry this things very p0icky that way
01:24 ^🔗	arrith	yeah sure
01:25 ^🔗	bsmith093	i can run it from here though :) once i have the command, or if i can ever figure out regex like this for myself :)
01:26 ^🔗	arrith	bsmith093: i dunno if generating the numbers beforehand is faster but, this is what you'd run after that last one (echo \| tr thing): while read $num; do echo http://www.fanfiction.net/s/$num/1/ > linklist.txt; done < file.txt
01:26 ^🔗	bsmith093	i swear every script i've ever seen thats more complicated that wget her and grab this put it there, looks like chinese to me
01:26 ^🔗	arrith	oh. well. there's that if you want/need it for inspiration
01:26 ^🔗	arrith	yeah bash on a single line isn't too friendly
01:26 ^🔗	bsmith093	thanks
01:26 ^🔗	Coderjoe	keep in mind that /1/ needs to be incremented as well until you run out of chapters
01:27 ^🔗	arrith	oh, if so then the linklist would just have chapter1 for everything
01:27 ^🔗	arrith	how many chapters should we look for for each?
01:28 ^🔗	arrith	er
01:28 ^🔗	arrith	bsmith093: the command should actually have "> linklist.txt" at the end: while read $num; do echo http://www.fanfiction.net/s/$num/1/; done < file.txt > linklist.txt
01:29 ^🔗	Coderjoe	number of chapters depends on the story
01:29 ^🔗	bsmith093	on my end its still generating the file of numbers
01:30 ^🔗	arrith	bsmith093: the "echo {0000000..9999999} \| tr ' ' '\n' > file.txt" ?
01:30 ^🔗	bsmith093	yes
01:30 ^🔗	arrith	ah, ok
01:30 ^🔗	Coderjoe	holy crap. VASTLY different youtube front page, too
01:30 ^🔗	arrith	file should be around 77MB so you can track its progress looking at that
01:30 ^🔗	arrith	bsmith093
01:31 ^🔗	arrith	Coderjoe: yeah but similar to how some ids will not go to real stories, one must kind of pick a default for how many chapters to look for. stopping trying to download more than 4 if say chapter 4 isn't found would be good, but that requires tool support
01:32 ^🔗	rude___	no prob SketchCow.. got more coming your way soon
01:32 ^🔗	godane	did comcast start speeding up download speeds?
01:32 ^🔗	godane	i only ask cause i have 800kbytes down
01:33 ^🔗	arrith	googling found this http://bashscripts.org/forum/viewtopic.php?f=8&t=1081
01:33 ^🔗	arrith	godane: in some areas they increase the dl speed. people usually get an email
01:33 ^🔗	bsmith093	the thing i have at fanfictiondownloader will auto download if it has more than one chapter, i just have to find out if it wwill continure upon find ing an invalid is
01:34 ^🔗	arrith	oh
01:34 ^🔗	bsmith093	is sorry this is really slowing down my laptop
01:34 ^🔗	bsmith093	id oy vey typos
01:34 ^🔗	arrith	bsmith093: yeah. maxed out my computer for a little bit. you can renice it and it'll go slower but not take over so much
01:37 ^🔗	Coderjoe	new frontpage: http://i.imgur.com/RPw6K.png
01:38 ^🔗	PatC	Coderjoe, yep :/
01:38 ^🔗	arrith	wow
01:39 ^🔗	arrith	that's quite a change
01:41 ^🔗	bsmith093	how do i track a files changes in realtime cli
01:42 ^🔗	bsmith093	when another process is editing it
01:45 ^🔗	arrith	bsmith093: "tail -f file.txt" will output lines getting added to a file. but what i'd do if i were you is "watch ls -lh" in the directory you're generating the txt
01:45 ^🔗	arrith	watch reruns a command, by default every 2 seconds, so you can see how big it's getting
01:46 ^🔗	arrith	tail -f might slow it down is why i say ls over tail
01:48 ^🔗	bsmith093	its at 270MB and rising
01:48 ^🔗	bsmith093	linklist
01:48 ^🔗	arrith	er
01:48 ^🔗	arrith	oh
01:48 ^🔗	arrith	i thought for a sec you meant file.txt, heh that'd be way too big
01:59 ^🔗	arrith	btw seems the overall count is 10^7 - 1
01:59 ^🔗	arrith	for ease of notation
02:00 ^🔗	bsmith093	that finally completed linklist is full of these http://www.fanfiction.net/s//1/
02:00 ^🔗	bsmith093	inbetween the double slashes is where the id gpes
02:01 ^🔗	bsmith093	sorry minor glitch there, and i cant see why
02:01 ^🔗	bsmith093	stopped growing and completed at 306mb
02:01 ^🔗	arrith	hmmm
02:01 ^🔗	arrith	bsmith093: are you on ubuntu or osx?
02:01 ^🔗	bsmith093	ubuntu
02:02 ^🔗	bsmith093	lucid 10.04 32bit
02:03 ^🔗	bsmith093	where was i suppoosed to run the while read $num; do echo http://www.fanfiction.net/s/$num/1/; done < file.txt > linklist.txt casue i just ran it in the terminal, like the generator command earlier
02:03 ^🔗	Coderjoe	"I can't believe my dress ripped. They saw everything! Even my Ranma panties. They change color when I get wet."
02:04 ^🔗	arrith	bsmith093: ah yeah i'm getting the same result. sorry about that
02:04 ^🔗	bsmith093	wait how does it know where $num is?
02:05 ^🔗	arrith	bsmith093: the "while read $num" is supposed to operate on each line of the file piped in, which is "< file.txt"
02:05 ^🔗	bsmith093	oh, see this is why im going to be taking a sed and bash scripting class in college
02:07 ^🔗	arrith	errr
02:07 ^🔗	arrith	bsmith093: "while read num" not "while read $num"
02:07 ^🔗	bsmith093	um ok then re running now
02:07 ^🔗	arrith	while read num; do echo http://www.fanfiction.net/s/$num/1/; done < file.txt > linklist.txt
02:07 ^🔗	arrith	all the same except for that first part
02:08 ^🔗	arrith	sorry about that
02:09 ^🔗	bsmith093	running perfectly now just gotta wait again
02:09 ^🔗	bsmith093	meantime lets see what my actual downloader will do to an invalid id
02:12 ^🔗	arrith	bsmith093: good idea
02:13 ^🔗	arrith	bsmith093: btw by my calculations the resulting file should be 80000000 + (10^7-1) * 31 bytes or 371.932954 megabytes (according to google)
02:14 ^🔗	arrith	http://www.google.com/search?q=(80000000+%2B+(10^7-1)+*+31)+bytes+to+megabytes
02:16 ^🔗	bsmith093	well it works great if the link is valid, other wise it dies
02:17 ^🔗	bsmith093	althought there is always the scripts its based on.. hold o0n a min
02:19 ^🔗	bsmith093	here grab this http://fanficdownloader.googlecode.com/files/fanficdownloader-4.0.6.zip
02:20 ^🔗	arrith	ahh python
02:22 ^🔗	bsmith093	ok now go in and take a look at downloader.py apparently the only thing it CANT do is read links from a file
02:23 ^🔗	arrith	heh
02:23 ^🔗	bsmith093	is there a pipe for that?
02:23 ^🔗	arrith	welll if i knew python it shouldn't be too hard to add that functionality
02:23 ^🔗	arrith	oh
02:23 ^🔗	arrith	i'd do a bash "while read"
02:24 ^🔗	Coderjoe	http://boingboing.net/2011/12/01/your-tax-dollars-at-work-misl.html
02:24 ^🔗	arrith	bsmith093: while read link; do python downloader.py $link; done < linklist.txt
02:24 ^🔗	arrith	er
02:24 ^🔗	arrith	but if you want .html you have to specify that manually it says
02:25 ^🔗	arrith	so
02:25 ^🔗	arrith	bsmith093: while read link; do python downloader.py -f html $link; done < linklist.txt
02:25 ^🔗	Coderjoe	bah
02:25 ^🔗	arrith	i'd try with 3-5 links before running it on linklist
02:25 ^🔗	Coderjoe	python is simple
02:26 ^🔗	arrith	Coderjoe: not for someone with a huge mental block against learning things in one sitting. i've been meaning to learn it for like a years now. ;(
02:26 ^🔗	Coderjoe	what programming languages do you know?
02:26 ^🔗	bsmith093	yeah apparently this script was heavily modified to make the binary blob i found, but he did say that, so,... anyway this one wants full urls, not just nice sequential ids
02:27 ^🔗	Coderjoe	(and a real programmer should be able to figure out other languages of the same type fairly easily)
02:27 ^🔗	bsmith093	fanficdownloader.net is there any way to see inside a linux blob
02:28 ^🔗	Coderjoe	the source is all in the zip file
02:29 ^🔗	arrith	Coderjoe: bash and ti basic
02:30 ^🔗	arrith	fanficdownloader.net isn't loading for me
02:30 ^🔗	bsmith093	yes but i cant really read python so if uve got it go here fanficdownloader-4.0.6/fanficdownloader/adapters/adapter_fanfictionnet.py
02:30 ^🔗	arrith	bsmith093: what's the difference between the links in linklist and "full urls"?
02:30 ^🔗	Coderjoe	default format of the fanficdownloader python in a zip file is epub
02:30 ^🔗	bsmith093	fanfictiondownloader.net
02:31 ^🔗	bsmith093	not fanfic
02:31 ^🔗	Coderjoe	it can also do html or txt
02:31 ^🔗	arrith	oh, help for it seems to say just epub or html
02:31 ^🔗	arrith	derp, nvm. "text or html"
02:32 ^🔗	Coderjoe	though I would prefer to call out to wget-warc or something else that packs a warc
02:32 ^🔗	bsmith093	linklist has these http://www.fanfiction.net/s/5192986
02:32 ^🔗	bsmith093	the script currently wants these http://www.fanfiction.net/s/5192986/1/A_Fox_in_Tokyo
02:32 ^🔗	bsmith093	even though it splices out the story id anyway?!
02:33 ^🔗	arrith	yeah it shouldn't need those
02:33 ^🔗	arrith	bsmith093: wait so it complains if you don't put in the story id?
02:33 ^🔗	bsmith093	no it complains if you leave off the title like this http://www.fanfiction.net/s/5192986/1/A_Fox_in_Tokyo
02:34 ^🔗	bsmith093	the thing after the id it wants that
02:34 ^🔗	bsmith093	which is arbitrary, and not at all sequential
02:34 ^🔗	arrith	bsmith093: try putting a placeholder thing there. like "foo"
02:34 ^🔗	arrith	if it just strips it
02:35 ^🔗	bsmith093	npe chokes
02:35 ^🔗	bsmith093	it reads the full url and loads it
02:35 ^🔗	bsmith093	so that wont work
02:36 ^🔗	bsmith093	fanficdownloader-4.0.6/fanficdownloader/adapters/base_adapter.py", line 166, in _fetchUrl
02:36 ^🔗	bsmith093	raise(excpt)
02:38 ^🔗	arrith	hmm
02:46 ^🔗	arrith	bsmith093: do you want epub or html?
02:46 ^🔗	bsmith093	i suppose for future proffinging purposes not to mention formatting html would be best
02:47 ^🔗	arrith	ah
02:47 ^🔗	arrith	well
02:47 ^🔗	arrith	i'm not getting the error you're getting for some reason
02:48 ^🔗	arrith	all of these link formats work for me with fanficdownloader-4.0.6: http://www.fanfiction.net/s/5192986/ http://www.fanfiction.net/s/5192986/1/ http://www.fanfiction.net/s/5192986/1/A
02:49 ^🔗	bsmith093	hmmm...
02:49 ^🔗	arrith	i'm doing this basically python /home/arrith/bin/fanficdownloader-4.0.6/downloader.py -f html http://www.fanfiction.net/s/5192986/
02:50 ^🔗	arrith	bsmith093: pastebin all the output downloader.py gives you
02:50 ^🔗	arrith	says stuff like this for me "DEBUG:downloader.py(93):reading [] config file(s), if present"
02:54 ^🔗	arrith	Coderjoe: have you seen anyone using MHT stuff? or is it not that good compared to WARC? (as in this: https://addons.mozilla.org/en-US/firefox/addon/mozilla-archive-format/ )
02:56 ^🔗	bsmith093	arrith: here http://pastebin.com/XhecfW5M
02:57 ^🔗	arrith	bsmith093: hmm that's pretty odd. at first glance it almost looks like fanfiction.net blocked you
02:58 ^🔗	bsmith093	i knew this would be more complicated than just a simple mirror
02:59 ^🔗	arrith	bsmith093: can you go to the url for the fanfiction in a browser or with wget or curl?
02:59 ^🔗	bsmith093	just thinking of that hold on
03:00 ^🔗	bsmith093	yes but it saves the page as index.html, and with all the iamges and other things
03:00 ^🔗	arrith	yeah
03:00 ^🔗	arrith	huh, but it lets you. interesting
03:01 ^🔗	bsmith093	but we could run the linkost throught wget and stripout the 404s right
03:01 ^🔗	bsmith093	but would they even be 404?, damn this is hard
03:02 ^🔗	arrith	a good site would have them as 404s. but yeah, wget has a good option for that: --spider
03:03 ^🔗	arrith	that's a good idea since wget i think would be much faster than this python script
03:03 ^🔗	arrith	bsmith093: btw when you said this earlier, were you talking about fanfiction.net? "<bsmith093> even with the useragent workaround, stopped at 300Kfiles, and i know there are at least 2million stories"
03:04 ^🔗	bsmith093	oh yeah, but how do i gt that to make a list in a file of the storylinks, sorry for being this helpless, its just wget scares me with its many switches and arcane syntyax
03:04 ^🔗	bsmith093	yes i was
03:04 ^🔗	bsmith093	hold on ill lok in the bash history for the command
03:05 ^🔗	arrith	bsmith093: np, i like helping. just i hope at some future point you'll look back over the commands and try to understand them
03:06 ^🔗	bsmith093	wget-warc -mcpkKe robots=off -U="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" www.fanfiction.net
03:06 ^🔗	bsmith093	well, using them this much is helping :)
03:07 ^🔗	arrith	yeah. basically i know what i know due to lots of little jobs. one day i should sit down and read official documentations beyond the manpage but ehh.. not today
03:07 ^🔗	bsmith093	me too
03:07 ^🔗	arrith	bsmith093: ah, so you were trying to dl ff.net using wget-warc
03:07 ^🔗	arrith	might be why ff.net wouldn't like you that much :P
03:07 ^🔗	bsmith093	yeah, use a sledgehammer...
03:08 ^🔗	chronomex	to drive a screw!
03:08 ^🔗	chronomex	sure, after much cursing you'll get it in there
03:08 ^🔗	chronomex	but is that really what you need to do?
03:08 ^🔗	bsmith093	chronomex: wher've you been all this time
03:08 ^🔗	chronomex	when?
03:08 ^🔗	bsmith093	good to have you in the conversation :)
03:08 ^🔗	*	chronomex waves
03:09 ^🔗	bsmith093	trying to dl fanfiction.net
03:09 ^🔗	chronomex	so I see
03:09 ^🔗	bsmith093	any suggestions we're onto wget finally, but its tricky
03:10 ^🔗	arrith	dang, ff.net doesn't give a 404 for nonexistent stories
03:10 ^🔗	bsmith093	could we parse the page for story not found
03:10 ^🔗	arrith	yeah, gotta do that
03:10 ^🔗	bsmith093	or is it faster to grab anyway, then parse later
03:10 ^🔗	arrith	bsmith093: but that involves dling the page, which is more bandwidth ff.net has to suffer
03:11 ^🔗	bsmith093	wget has a random wait option
03:11 ^🔗	arrith	oh. all depends on what you want to do. could do it both ways. i tend to like grabbing then parsing later, but i figure diskspace is cheap
03:11 ^🔗	bsmith093	cant remember th e switch
03:11 ^🔗	arrith	--random-wait :)
03:11 ^🔗	bsmith093	hey will -spider tell us how big it is
03:12 ^🔗	arrith	random wait is more about not blocking wget than saving the host bandwidth
03:12 ^🔗	arrith	ah interesting thought
03:12 ^🔗	bsmith093	cause that would be good to know
03:12 ^🔗	arrith	doesn't seem to. just says "Length: unspecified [text/html]"
03:12 ^🔗	bsmith093	also on the archiveteam, see where tsp got to, this was apparently his baby
03:12 ^🔗	arrith	i'm just looking at the output of this btw: wget --spider http://www.fanfiction.net/s/9999999/
03:13 ^🔗	bsmith093	sorry the archiveteam wiki
03:13 ^🔗	bsmith093	and
03:14 ^🔗	arrith	np, i got that. only mention i find says "Tsp is attempting to archive the stories from fanfiction.net and fictionpress." on http://archiveteam.org/index.php?title=Projects
03:14 ^🔗	bsmith093	plus that might not necessarily grab all the chapters, either.
03:15 ^🔗	chronomex	been a while since I've seen Teaspoon around
03:16 ^🔗	arrith	bsmith093: what won't get all the chapters? wget? i figured we're just using wget (or curl) to check if the story exists, then feeding a list of stories into download.py
03:16 ^🔗	arrith	dang, would be nice if that guy had some writeup somewhere on his progress
03:18 ^🔗	bsmith093	oh yeah, the python script
03:18 ^🔗	bsmith093	i know, right?
03:19 ^🔗	arrith	bsmith093: i gotta go for a bit for dinner. i'll bbl. next step i see is to get curl or wget to go over the linklist and sort them into a known good (and maybe known bad) list. i'll help with that if you need it when i get back
03:20 ^🔗	bsmith093	ill look ove the wget man pages to see i fwe cant just sane eveeryhting as a uniwue name, and sort later
03:20 ^🔗	bsmith093	Coderjoe: you still here
03:20 ^🔗	bsmith093	???
03:21 ^🔗	bsmith093	chronomex:
03:21 ^🔗	bsmith093	alard: \
03:21 ^🔗	chronomex	what, hi
03:21 ^🔗	chronomex	I'm tending a makerbot.
03:21 ^🔗	bsmith093	oohh , yay for you!]
03:21 ^🔗	chronomex	so I'm here, just not watching irc
03:22 ^🔗	bsmith093	can curl parse html, for a certain string
03:22 ^🔗	chronomex	no
03:22 ^🔗	bsmith093	( i know nothing about it)
03:22 ^🔗	chronomex	curl does not look at what it downloads for you
03:22 ^🔗	bsmith093	cause i have a massive linklist for ff.net and most of them probably dont exist
03:23 ^🔗	bsmith093	ffnet returns html with story not found if that id doesnt exist
03:23 ^🔗	chronomex	sounds like you'll have to write a custom-ishspider
03:24 ^🔗	bsmith093	ugh
03:25 ^🔗	bsmith093	any ideas on code thats already done part of something like this?
03:26 ^🔗	bsmith093	alard: u seem to write most of the scripts archiveteam uses, any ideas o n saving ff.net
03:38 ^🔗	underscor	<alard> For some reason heritrix doesn't really listen to my parallelQueues = 15 setting. It's just running one queue
03:39 ^🔗	underscor	From what I remember of the presentation at IA, you can't spread the same domain into multiple queues
03:43 ^🔗	chronomex	huh.
03:43 ^🔗	underscor	Also, you gamepro guys
03:43 ^🔗	underscor	Make sure you're getting the articles, there are a lot of interstitials
03:43 ^🔗	chronomex	server-side interstitials?
03:44 ^🔗	bsmith093	any suggestions for things that might work for ffnet
03:44 ^🔗	underscor	The two I got were meta redirects
03:44 ^🔗	underscor	so, yeah
03:44 ^🔗	chronomex	o ok
03:44 ^🔗	underscor	bsmith093: not off the top of my head
03:44 ^🔗	underscor	but it's definitely something I'm interested in
03:48 ^🔗	bsmith093	hey do invalid links have identical md5sum
03:48 ^🔗	bsmith093	doesnt really solve the bandwidth load issue, but it would help with weeding
03:48 ^🔗	Coderjoe	bsmith093: sorry, I was trying to get some work done
03:49 ^🔗	bsmith093	np, we all have lives ;)
03:49 ^🔗	bsmith093	whoops wrong emote
03:55 ^🔗	arrith	back
03:55 ^🔗	bsmith095	hey
03:56 ^🔗	arrith	bsmith095: hey
03:56 ^🔗	bsmith095	so i was looking and i cant find anything to parse a webpage, which is odd
03:56 ^🔗	arrith	ah
03:56 ^🔗	arrith	checking for a thing existing or not shouldn't be hard. just grab the page then grep it
03:56 ^🔗	arrith	well there's a bunch of python libraries to do it 'properly' but i just use grep and exit codes
03:57 ^🔗	bsmith095	yeah but again the huge bandwidth issue
03:57 ^🔗	bsmith095	and i have absolutely no idea how u guys do it, but id love to paralellize this problem, how much space could 20 billion words possibly take up?
03:57 ^🔗	arrith	bsmith095: oh, ehh. well at least with wget i wasn't really able to find a way to get it to report page size
03:58 ^🔗	bsmith095	do they have same md5sum that ould help
03:59 ^🔗	arrith	yeah that's one way. but i'm pretty sure a grep would work fine. it wouldn't be futureproof but it'd get the job done at first
03:59 ^🔗	arrith	bsmith095: but wait, so did that one wget-warc download a decent amount of ff.net? since you said it got up to 300k or something?
03:59 ^🔗	bsmith095	yeah but it was beat t hell with hmtl, and css and ads and things, plus i needed the space so i dumped it
04:00 ^🔗	arrith	ah
04:00 ^🔗	arrith	bsmith095: i was just wondering what kind of error you ended up getting since that's when ff.net might've blocked you
04:00 ^🔗	arrith	btw about 20 billion words, assuming 5 characters per word and a following space: 20 billion * ((5 bytes) + (1 byte)) = 111.758709 gigabytes
04:00 ^🔗	bsmith095	holy christ, thats a lot! even for now, wehere u can buy terabytes like bread
04:01 ^🔗	arrith	heh
04:01 ^🔗	arrith	well compression goes a heck of a long way though. bzip or gzip should do a small fraction of that
04:01 ^🔗	bsmith095	btw thats the estimate of how big ff is
04:01 ^🔗	arrith	for text at least
04:01 ^🔗	arrith	ah
04:01 ^🔗	arrith	well i'd say you could get that down to maybe a few GB, probably less
04:01 ^🔗	arrith	with compression
04:03 ^🔗	bsmith095	can i compress than immediately dump the uncompressed, without completely killing my slighty overworked hardrive?
04:03 ^🔗	bsmith095	is there an app for that, cuz it would rock
04:05 ^🔗	arrith	bsmith095: that'd just be part of a script
04:06 ^🔗	arrith	like have the download.py grab the file, then compress it right after
04:06 ^🔗	bsmith095	man, i have got to learn scripting :D
04:06 ^🔗	arrith	it'd compress individually which won't save as much space, but you can recompress them all in batch after it's done
04:07 ^🔗	arrith	well i can put together a small thing in bash that'll get this job done. then you can learn python and make it all fancy :)
04:07 ^🔗	bsmith095	thatll work
04:07 ^🔗	arrith	one thing you gotta figure out is why download.py isn't working though
04:07 ^🔗	bsmith095	hey does amaozn ec2 have a trial i could completely kill one of their instances so my porr laptop doenst have to
04:09 ^🔗	Coderjoe	good god ff.net are pricks.
04:09 ^🔗	Coderjoe	Disallow: /
04:09 ^🔗	Coderjoe	User-agent: ia_archiver
04:09 ^🔗	bsmith095	i figure i could dump the links into wget , have it name the files based on the id# then grep for story not found
04:09 ^🔗	bsmith095	use wget
04:10 ^🔗	arrith	one sec
04:10 ^🔗	chronomex	Coderjoe: I'm not surprised, given the fanfic people I've known.
04:10 ^🔗	bsmith095	would amazon ec2 let me use them for this?
04:11 ^🔗	chronomex	there's so much shady shit on ec2
04:11 ^🔗	arrith	bsmith095: might be. but don't run it on anything you need for business incase it does get shut down
04:11 ^🔗	chronomex	just pay your bill and don't run a botnet.
04:11 ^🔗	bsmith095	meaning what?
04:11 ^🔗	Coderjoe	they'll let you do a lot... but you will probably wind up paying a bunch in bandwidth and instance
04:12 ^🔗	bsmith095	ugh, bandwidth again, its 2011, nearly 2012, i thought we were past this!
04:12 ^🔗	Coderjoe	(my bill for nov is $255.62)
04:12 ^🔗	Coderjoe	not in the server market
04:12 ^🔗	bsmith095	for what kinnd of usage
04:12 ^🔗	Coderjoe	general rule is "sender pays"
04:13 ^🔗	bsmith095	can this thing be paralellized easliy
04:14 ^🔗	arrith	bsmith095: yeah
04:14 ^🔗	arrith	when we did the google video stuff we just put together chunks and people claimed the chunks
04:14 ^🔗	arrith	one person does, 1-20,000, another does 20,000-40,000, etc
04:14 ^🔗	arrith	er 20,001
04:15 ^🔗	bsmith095	well in have that 300mb file i could pass around
04:15 ^🔗	arrith	yeah. well i think there's some kind of script already for delegating stuff that was mentioned earlier
04:15 ^🔗	arrith	i was gonna look into that
04:15 ^🔗	arrith	"<db48x> integrate it with the scripts we used for splinder/mobileme/anyhub"
04:16 ^🔗	arrith	whatever those are
04:16 ^🔗	Coderjoe	$97.99 in instance charges (one free micro, 245 hours of an m2.xlarge, 194 hours of an m1.large), $64.26 in s3 (stashed some grab stuff in there to get it off an instance. the storage was cheaper than the bandwidth out.), $93.37 in data transfer (873.951GB in for free, 15GB out for free, 778.049GB out for $0.120/gb)
04:16 ^🔗	bsmith095	therea a repo for them
04:17 ^🔗	Coderjoe	i would much rather do parallelization with a full clean script and a tracker that hands out chunks of a few stories
04:17 ^🔗	bsmith095	whoo! thats cheap but not super cheap
04:17 ^🔗	Coderjoe	and I was using spot instances for those two non-free instances
04:17 ^🔗	underscor	<Coderjoe> User-agent: ia_archiver
04:17 ^🔗	underscor	<Coderjoe> Disallow: /
04:17 ^🔗	underscor	I wish they would just disobey it
04:18 ^🔗	underscor	I mean
04:18 ^🔗	underscor	Archive teh site regardless
04:18 ^🔗	Coderjoe	when is ff.net going down?
04:18 ^🔗	underscor	but if the robots.txt blocks it, just don't make it public
04:18 ^🔗	bsmith095	the story links are fanfiction.net/s/0000000 through 9999999
04:18 ^🔗	arrith	Coderjoe: i don't think it is. i think this is just pre-emptive
04:18 ^🔗	underscor	Coderjoe: Pre-emptive afaik
04:18 ^🔗	bsmith095	Coderjoe: its not that i know fo, im being proactive
04:19 ^🔗	bsmith095	this is worse than geocities, mostly b/c the "creative, irriplaceable stuff" wuotient is much higher
04:19 ^🔗	bsmith095	quotient, can u tell im typing byt the light of my monitor
04:19 ^🔗	underscor	bsmith095: Why not iterate through every combination?
04:20 ^🔗	arrith	btw was it determined if the geocities effort got all of geocities or were some sites lost?
04:20 ^🔗	Coderjoe	underscor: each story has chapters which are on separate pages
04:20 ^🔗	bsmith095	we could and i was going to, but thats 10 million links, most of which are non existent story wise, but which give back a page saying stroy not found
04:20 ^🔗	Coderjoe	arrith: i think sites were lost
04:20 ^🔗	arrith	underscor: we've done that. we have a tool that checks each fanfiction id for chapters
04:20 ^🔗	bsmith095	and also that chapter thing
04:21 ^🔗	bsmith095	arrith: we do
04:21 ^🔗	bsmith095	??
04:21 ^🔗	arrith	bsmith095: oh yeah, sorry, i thought you knew kinda. the download.py takes just a normal link and grabs all available chapters
04:21 ^🔗	underscor	bsmith095: No need
04:21 ^🔗	arrith	one sec i'll pastebin
04:21 ^🔗	underscor	Just send a HEAD request
04:21 ^🔗	underscor	Only a few bytes
04:21 ^🔗	bsmith095	a what now?
04:21 ^🔗	underscor	curl -I http://www.fanfiction.net/s/7597723
04:22 ^🔗	underscor	Just gets the headers
04:22 ^🔗	underscor	Tells you whether a story exists or not
04:22 ^🔗	underscor	Then you can go back later on
04:22 ^🔗	arrith	bsmith095: http://pastebin.com/kKpNxEBy
04:22 ^🔗	bsmith095	underscor: I HEART U
04:22 ^🔗	arrith	underscor: oh yeah that's what we're trying to do now. i was gonna wget then grep for "story not found", but hmm
04:22 ^🔗	underscor	but then you have to download the whole page
04:22 ^🔗	bsmith095	thats exactly what i was looking for!!!
04:22 ^🔗	underscor	-I is a lot better
04:22 ^🔗	underscor	Now, the interesting thing
04:22 ^🔗	arrith	underscor: is -I to chek for a 404?
04:23 ^🔗	underscor	is that it always returns 200 Ok
04:23 ^🔗	underscor	Nope
04:23 ^🔗	underscor	It just sends you the ehaders
04:23 ^🔗	underscor	But a valid story will have a header like
04:23 ^🔗	underscor	Cache-Control: public,max-age=1800
04:23 ^🔗	underscor	Last-Modified: Fri, 02 Dec 2011 04:21:35 GMT
04:23 ^🔗	underscor	Invalid ones will have
04:23 ^🔗	underscor	Cache-Control: no-store
04:23 ^🔗	underscor	Expires: -1
04:23 ^🔗	bsmith095	hey thanks want the linklist
04:23 ^🔗	arrith	ohh clever
04:23 ^🔗	underscor	bsmith095: Isn't it just 0-999999
04:24 ^🔗	underscor	?
04:24 ^🔗	bsmith095	see io knew the web would come up with something!
04:24 ^🔗	arrith	0000000 actually
04:24 ^🔗	bsmith095	yes
04:24 ^🔗	arrith	i think
04:24 ^🔗	bsmith095	7 digits
04:24 ^🔗	arrith	dunno if 0 works too
04:24 ^🔗	bsmith095	probably
04:24 ^🔗	arrith	yeah
04:24 ^🔗	underscor	ok
04:24 ^🔗	arrith	10^7-1
04:24 ^🔗	bsmith095	no 7 dig
04:24 ^🔗	bsmith095	10 million
04:24 ^🔗	underscor	seq -w 0 9999999
04:24 ^🔗	underscor	bam
04:24 ^🔗	arrith	we used this to gen a numberlist: echo {0000000..9999999 } \| tr ' ' '\n' > file.txt
04:24 ^🔗	underscor	oh, that works too
04:25 ^🔗	arrith	ah yeah, i wasn't sure about seq. just the #bash people always say to use {x..n} over seq, forget why
04:25 ^🔗	underscor	because it's a builtin probably
04:25 ^🔗	underscor	I prefer seq though, personally
04:25 ^🔗	arrith	ah seq does newlines, nice
04:26 ^🔗	arrith	just for fun i'm "time"ing them
04:26 ^🔗	underscor	Yeah, that's one of the reasons
04:26 ^🔗	bsmith095	190s probably
04:27 ^🔗	arrith	0m8.544s for seq, my echo one is still running
04:27 ^🔗	arrith	just finished 0m39.947s
04:27 ^🔗	bsmith095	ure fast
04:27 ^🔗	arrith	seq is so the way to go heh
04:28 ^🔗	bsmith095	now we just have to get the damn downloader script to take id# as opposed to id# and title links
04:28 ^🔗	arrith	bsmith095: welll i think that's an ip issue, not the script necessarily
04:28 ^🔗	*	underscor quietly works on his own version in bash
04:28 ^🔗	arrith	since it works fine for me
04:28 ^🔗	arrith	underscor: haha
04:28 ^🔗	arrith	lemme pastebin the snippets i have so far
04:28 ^🔗	underscor	I actually started this back in like March
04:29 ^🔗	underscor	haha
04:29 ^🔗	arrith	ohh
04:29 ^🔗	arrith	good to hear
04:30 ^🔗	bsmith095	pass me a valid link
04:30 ^🔗	arrith	actually the stuff i have is just weird stuff using grep and a thing to generate a ~350MB file of linklists
04:30 ^🔗	bsmith095	just to test
04:30 ^🔗	arrith	http://www.fanfiction.net/s/7597723
04:30 ^🔗	bsmith095	anyone have a valid story link
04:31 ^🔗	arrith	or http://www.fanfiction.net/s/5192986/
04:32 ^🔗	arrith	underscor: is your stuff in a single bash script? and are you using fanficdownloader-4.0.6 ?
04:32 ^🔗	bsmith095	output http://pastebin.com/5Q09g7xB
04:32 ^🔗	underscor	My stuff is a single bash script
04:32 ^🔗	underscor	and no, I didn't know it existed
04:32 ^🔗	chronomex	clearly not enterprisey enough
04:33 ^🔗	bsmith095	pastebit it please?
04:33 ^🔗	Coderjoe	underscor: for distributed efforts, I would prefer something like python over bash. bash relies on other processes on the system and as a result has too many variations
04:33 ^🔗	chronomex	yes, bash scripts are pretty fragile
04:33 ^🔗	underscor	Absolutely, I agree
04:34 ^🔗	bsmith095	really? yours have been pretty robust
04:34 ^🔗	underscor	I'm not very comfortable with python though, so I'm just dicking around in bash atm
04:34 ^🔗	chronomex	it takes work to make them robust.
04:34 ^🔗	Coderjoe	(farming out to a wget process is ok)
04:34 ^🔗	bsmith095	ah
04:34 ^🔗	*	chronomex currently scraping several million pages with ruby
04:34 ^🔗	arrith	learning python is something. but eh, you can keep bash pretty portable
04:35 ^🔗	bsmith095	chronomex: ruby?
04:35 ^🔗	chronomex	bsmith095: yes, I've been using ruby lately
04:35 ^🔗	underscor	Ruby has a lovely http library
04:35 ^🔗	bsmith095	weeding the linklist
04:35 ^🔗	arrith	underscor: getting the ff.net effort as part of the scripts for mobile me and stuff to distribute the effort i think would be good
04:35 ^🔗	underscor	typhoeus or something
04:35 ^🔗	underscor	arrith: I agree
04:36 ^🔗	underscor	However, I am probably not your man
04:36 ^🔗	bsmith095	typhoes ?? whst
04:36 ^🔗	underscor	mostly due to time
04:36 ^🔗	arrith	underscor: is your bash stuff at a state you can show people?
04:36 ^🔗	chronomex	underscor: I use Mechanize and Hpricot.
04:36 ^🔗	underscor	arrith: Not atm
04:36 ^🔗	underscor	I'll work on fixing it up
04:36 ^🔗	underscor	chronomex: https://github.com/dbalatero/typhoeus
04:37 ^🔗	Coderjoe	arrith: not really. dld-streamer.sh (and my chunky.sh from friendster) relied on associative arrays. CentOS has too old of a bash. freebsd and osx have BSD userland, while unix people typically have gnu userland. and there have been bugs between different version of tools like expr.
04:37 ^🔗	chronomex	underscor: hmmm, neat. I'm scraping single sites, though, so I don't have much use for 1000 threads :P
04:37 ^🔗	arrith	underscor: hmm alright. i'm not sure what you've done so i dunno if the current methods bsmith095 and i are using are the best ones
04:37 ^🔗	underscor	I tend to do things the "fuck the building is burning down, just get some shit written" way
04:37 ^🔗	underscor	so anything y'all do is probably cleaner
04:37 ^🔗	chronomex	^
04:38 ^🔗	underscor	My bash scripts are basically the exact opposite of alard's
04:38 ^🔗	chronomex	I do things the underscor way, unless I'm releasing it into the wild.
04:38 ^🔗	arrith	i'm doing "i barely know how to string this stuff together but it seems to work so w/e"
04:38 ^🔗	underscor	Extremely non portable, and no idiot checks
04:38 ^🔗	bsmith095	yeah, alard would know, somebody wake him/(her?) up
04:38 ^🔗	underscor	him
04:38 ^🔗	bsmith095	ah
04:38 ^🔗	underscor	He's in NL (I think?) so it's early there (?)
04:39 ^🔗	bsmith095	where are all the female geeks
04:39 ^🔗	arrith	Coderjoe: ah expr bugs don't sound fun, and yeah you have to avoid a lot to make a portable script. just egh, bash just seems easier to me than python
04:39 ^🔗	arrith	bsmith095: asleep
04:39 ^🔗	bsmith095	NL, wheers that
04:39 ^🔗	chronomex	arrith: bash is easier to get into but harder to make work well.
04:39 ^🔗	Coderjoe	python is not that hard, and you have a HUGE standard library to rely on
04:39 ^🔗	bsmith095	terrible with geography
04:39 ^🔗	chronomex	bsmith095: the netherlands? that's in europe, silly
04:39 ^🔗	bsmith095	ah yeah the web is global, i forgot
04:40 ^🔗	chronomex	....
04:40 ^🔗	bsmith095	i once got into an argument with my dad that u cant go east to get to russia, im like no wait thats the other side of ... oh wow duh :D
04:41 ^🔗	bsmith095	been looking at maps too long need a glboe
04:41 ^🔗	bsmith095	globe
04:41 ^🔗	arrith	yeah dang, i gotta learn python. and go through all the grueling hours of relearning how to do some simple thing
04:41 ^🔗	Coderjoe	er, the problem was in grep
04:41 ^🔗	Coderjoe	https://github.com/ArchiveTeam/friendster-scrape/commit/b1f5b72cd13e20d6b02c20d8fc7b2710fc816a61
04:41 ^🔗	bsmith095	arrith: sorry for the tedium
04:41 ^🔗	arrith	with bash it's like you can copy and paste around a bunch, python feels like you can't just piece stuff together
04:42 ^🔗	arrith	Coderjoe: oh dang, grep -o. i've run into so many "-o" bugs it's not even funny
04:42 ^🔗	underscor	For example
04:42 ^🔗	underscor	This is what I'm using to test IDs
04:42 ^🔗	arrith	i just avoid it and do weird sed mangling
04:42 ^🔗	underscor	It's dirty as hell
04:42 ^🔗	underscor	var=`curl -s -I http://www.fanfiction.net/s/5983988\|grep Last`;if [ -z $var ]; then echo "Not a story";else echo "Story";fi
04:43 ^🔗	arrith	underscor: i was thinking about asking you how you did that. just now i was diffing the output of various curl -Is
04:43 ^🔗	bsmith095	xow print that list to a file and were golden'
04:43 ^🔗	underscor	I use grep -oP all the time
04:43 ^🔗	underscor	it's rad
04:43 ^🔗	bsmith095	wth is the z switch
04:43 ^🔗	chronomex	bsmith095: null-terminated
04:43 ^🔗	chronomex	oh, in [
04:43 ^🔗	chronomex	bsmith095: 'man test'
04:43 ^🔗	underscor	It's "if it is set"
04:44 ^🔗	arrith	yeah if it exists
04:44 ^🔗	Coderjoe	-z is string is empty
04:44 ^🔗	arrith	underscor: i tend to go off of grep's exit code
04:44 ^🔗	arrith	grep thing; if [ $? -eq 0 ]; then; stuff; fi
04:45 ^🔗	arrith	exit codes seem 'faster' to me
04:45 ^🔗	Coderjoe	underscor: what happens if the story has the word "Last" in it?
04:45 ^🔗	underscor	Doesn't matter
04:45 ^🔗	underscor	curl -I gets headers only
04:46 ^🔗	Coderjoe	oh. you're checking Last-modified
04:46 ^🔗	underscor	Yeah
04:46 ^🔗	underscor	invalid stories don't have ti
04:46 ^🔗	underscor	s/ti/it/
04:46 ^🔗	Coderjoe	why can't people just use frigging HTML status codes. this is EXACTLY what 404 is for, and you can still have your own custom 404 page
04:47 ^🔗	chronomex	s/HTML/HTTP/
04:47 ^🔗	Coderjoe	yes, I meant http
04:48 ^🔗	arrith	yeah for stories that don't exist they don't have Last-Modified, and they also have "Cache-Control: no-store" and "Expires: -1"
04:48 ^🔗	chronomex	that's fucking retarded.
04:48 ^🔗	arrith	Coderjoe: seriously. ff.net not using 404s is so annoying right now
04:49 ^🔗	underscor	I wonder how ff.net feels about 10 million HEAD requests
04:49 ^🔗	underscor	lol
04:50 ^🔗	arrith	they should've used 404s :P
04:50 ^🔗	arrith	better than grabbing the full page like i was gonna do..
04:50 ^🔗	underscor	Well, they'd be HEADs regardless
04:50 ^🔗	underscor	At least we don't have to grab the full page
04:50 ^🔗	arrith	oh, right
04:51 ^🔗	arrith	i guess i assumed whatever wget --spider does is as lightweight as it can get. i actually don't know what's in what it does
04:51 ^🔗	arrith	some kind of HEAD
04:52 ^🔗	arrith	what's the best thing like piratepad to use these days in terms of doesn't time out?
04:52 ^🔗	chronomex	typewith.me ?
04:52 ^🔗	arrith	oh hmm wait actually, there's one for code
04:52 ^🔗	arrith	i forget its name. it's new
04:53 ^🔗	Coderjoe	arrith: splinder wasn't using a status code to say "hey, we're temporarily down for maintenance". instead they redirected to /splinder_noconn.html which was a 200
04:53 ^🔗	arrith	Coderjoe: ahh wow
04:54 ^🔗	arrith	ahh i was thinking of stypi but it doesn't have bash/sh support
04:54 ^🔗	arrith	;/
04:54 ^🔗	arrith	i wonder what they do support is the closest to bash
04:59 ^🔗	underscor	\o/ Progress
04:59 ^🔗	underscor	http://pastebin.com/ReqNs8TF
05:00 ^🔗	arrith	underscor: looks good
05:02 ^🔗	bsmith093	i hate wireless when im at the fringes, what i miss?
05:03 ^🔗	arrith	bsmith093: one sec
05:06 ^🔗	arrith	bsmith093: http://pastebin.com/1QN2tagB
05:07 ^🔗	arrith	oh
05:07 ^🔗	arrith	bsmith093: also http://badcheese.com/~steve/atlogs
05:08 ^🔗	arrith	forgot this channel had that
05:13 ^🔗	bsmith093	ok now pass stpryinator the valid ids and itl gram them all
05:14 ^🔗	arrith	bsmith093: stpryinator?
05:14 ^🔗	bsmith093	storyinator the output of the laste pastebin link
05:15 ^🔗	bsmith093	do we have a weeded list yet?
05:16 ^🔗	arrith	bsmith093: is storyinator something? searching the backlog doesn't show anything
05:16 ^🔗	arrith	the download.py thing?
05:16 ^🔗	bsmith093	check the logs the last pastebin
05:17 ^🔗	arrith	ohh
05:17 ^🔗	arrith	bsmith093: storyinator is i guess the name of what underscor is working on. it's not done yet
05:17 ^🔗	bsmith093	http://pastebin.com/ReqNs8TF
05:19 ^🔗	arrith	bsmith093: yeah that, it's not done yet
05:19 ^🔗	arrith	but here: http://paste.pocoo.org/show/515656/
05:20 ^🔗	arrith	that's not to be run just as a script, but each piece kinda ran individually
05:21 ^🔗	arrith	bsmith093: should generate a list of good and bad IDs (lines 21-29) then you just feed the list of good IDs into the fanficdownloader (lines 32-34)
05:21 ^🔗	arrith	assumes you already have a nums.txt
05:23 ^🔗	bsmith093	yay u rock
05:24 ^🔗	bsmith093	see i knew this would'nt get done unless i nagged the community to get to it, and save already
05:25 ^🔗	bsmith093	200tb wont back itself up, but this is nothing compared to mobilme, in terms opf volume anyway
05:27 ^🔗	bsmith093	thiss will probably take all night
05:32 ^🔗	bsmith093	good news is i can just dump these good id# links back into the original fanfic downloader im personally using
05:37 ^🔗	underscor	wheeeee
05:37 ^🔗	underscor	Frontpage Gotten
05:37 ^🔗	underscor	Let's get some metadata.
05:37 ^🔗	underscor	Running storyinator on id 5983988
05:37 ^🔗	underscor	Title is A Different Beginning for the Cherry Blossom
05:37 ^🔗	underscor	Writen by Soulless Light, whose userid is 1807842
05:37 ^🔗	underscor	Placed in anime>>Naruto
05:38 ^🔗	bsmith093	can i get that script
05:40 ^🔗	bsmith093	apporx 7 hrs till the id sorter is done sorting
05:41 ^🔗	underscor	bsmith093: Doesn't actually download anything yet
05:41 ^🔗	bsmith093	try this, fanfictiondownloader.net
05:41 ^🔗	bsmith093	throw the good story ids in there
05:42 ^🔗	bsmith093	fanfictiondownloader.com
05:42 ^🔗	bsmith093	nevermind it is .net
05:43 ^🔗	bsmith093	www.fanfictiondownloader.net
05:43 ^🔗	arrith	oh one thing
05:43 ^🔗	bsmith093	what
05:43 ^🔗	arrith	bsmith093, underscor: the download.py thing only gets the stories i think
05:43 ^🔗	underscor	oh okay
05:43 ^🔗	arrith	but on ff.net there's like author commentary, history of stuff getting posted
05:43 ^🔗	bsmith093	ummm yeah thats the idea
05:43 ^🔗	underscor	I'm getting reviews and a bunch of stuff
05:43 ^🔗	arrith	yeah
05:44 ^🔗	arrith	there's a lot more to the site than the stories
05:44 ^🔗	arrith	so i'd want to include those in a proper archival process
05:44 ^🔗	arrith	underscor: good to hear
05:44 ^🔗	bsmith093	well ok then you get that ill get the stories
05:44 ^🔗	arrith	bsmith093: heh well a backup of just the stories is still good to have
05:44 ^🔗	arrith	then in a scramble there's a lot less to dl
05:44 ^🔗	bsmith093	how will we re run this to update the archive
05:44 ^🔗	bsmith093	just futureproofing here
05:44 ^🔗	arrith	yeah that's something, periodic rerunning
05:45 ^🔗	bsmith093	merge the deltas
05:45 ^🔗	arrith	i didn't really put stuff into that script for that above hmm
05:45 ^🔗	bsmith093	see this is why we only bother to archive closing sites, they dont change a smuch
05:46 ^🔗	underscor	well, just figure out what the current latest ID is
05:46 ^🔗	bsmith093	umm how exaclty
05:47 ^🔗	underscor	7601310 is the current latest
05:47 ^🔗	bsmith093	ure head thing done yet?
05:47 ^🔗	underscor	http://m.fanfiction.net/j/
05:47 ^🔗	bsmith093	as of when
05:47 ^🔗	underscor	10 seconds ago
05:47 ^🔗	underscor	It's whatever's on the top of that page
05:48 ^🔗	bsmith093	see this is what im talking about, we'll always be behin dthis site
05:48 ^🔗	arrith	underscor: and every number up to that is known used?
05:48 ^🔗	bsmith093	lots of skipped
05:48 ^🔗	arrith	hm
05:48 ^🔗	underscor	No
05:49 ^🔗	underscor	But it's easy to have something check that page every 5 minutes
05:49 ^🔗	Coderjoe	well, if you use wget-warc, with a cdx file, you can have it save the updated page
05:49 ^🔗	bsmith093	arrith: run ure own script youll see there are a lot of holes in the seq
05:49 ^🔗	arrith	bsmith093: ah
05:50 ^🔗	arrith	you'd need to check that page of new stuff pretty rapidly to make sure you don't miss anything
05:50 ^🔗	Coderjoe	not really
05:50 ^🔗	underscor	Yeah
05:50 ^🔗	underscor	once every 10 minutes is probably sufficient
05:51 ^🔗	underscor	or less often
05:51 ^🔗	arrith	since the only other way i can think of to get new chapters is to go through story IDs to check for any new ones, then check all working story IDs for new chapters
05:51 ^🔗	bsmith093	again, not to beat a dead horse, but this is a huge, popular, currently active website
05:51 ^🔗	underscor	yeah
05:51 ^🔗	Coderjoe	you just have one worker that checks it periodically, checks between the last-known-max and the latest on that page, and notifies the tracker
05:51 ^🔗	arrith	oh
05:51 ^🔗	bsmith093	and thats another thing can somebody please code up a tracker?
05:52 ^🔗	arrith	right if it's always sequential. for a second i thought it was random, nvm
05:52 ^🔗	bsmith093	so isit seq
05:52 ^🔗	Coderjoe	the holes are from deleted stories
05:52 ^🔗	arrith	does that updating page include reviews/author comments/user comments/etc though? looks to me like it's just new stories
05:53 ^🔗	bsmith093	thats a lot of deletions any hope of recovery?
05:53 ^🔗	bsmith093	ia waybac maybe?
05:53 ^🔗	underscor	ia doesn't archive it
05:53 ^🔗	underscor	arrith: nope
05:53 ^🔗	Coderjoe	they block IA, remember
05:54 ^🔗	bsmith093	WTH not?!
05:54 ^🔗	Coderjoe	they block IA, remember
05:54 ^🔗	bsmith093	so use googlebot, well its too late now, but still!
05:54 ^🔗	arrith	all the more reason to have an ongoing mirror
05:54 ^🔗	Coderjoe	http://www.fanfiction.net/robots.txt
05:54 ^🔗	arrith	which afaik means periodic respidering
05:54 ^🔗	arrith	for new comments, etc
05:55 ^🔗	underscor	yep
05:55 ^🔗	bsmith093	and periodic dumps, like for wikipedia, but actually GOOD
05:55 ^🔗	arrith	yeah i recently saw someone talking about looking for directions on how to setup a wikipedia dump and was having a bit of trouble. i dunno how easy it is but it didn't sound fun to me
05:56 ^🔗	Coderjoe	http://b.fanfiction.net/atom/j/0/2/0/
05:56 ^🔗	Coderjoe	"updated stories" in a nice rss feed
05:56 ^🔗	bsmith093	see what a kick in the pants can do for productivity, none of this,(afaik) was happening 6 hrs ago
05:56 ^🔗	arrith	Coderjoe: ah nice and structured. gj
05:57 ^🔗	bsmith093	ffnet is "fully automate"
05:57 ^🔗	arrith	bsmith093: ehh underscor was doing some stuff i think technically
05:57 ^🔗	bsmith093	thats why i said afaik
05:57 ^🔗	Coderjoe	and here's new stories: http://b.fanfiction.net/atom/j/0/0/0/
05:58 ^🔗	bsmith093	yay, an atom feed we can scrape that!
05:58 ^🔗	arrith	and wherever that one guy left off. if he left a record
05:58 ^🔗	bsmith093	he didnt
05:58 ^🔗	arrith	underscor: did you track how far Teaspoon / tsp got on ff.net?
05:59 ^🔗	underscor	Nope, sorry
05:59 ^🔗	bsmith093	SketchCow: if ura still up, any thought/input/ constructive critisisms
05:59 ^🔗	Coderjoe	reviews are under /r/ instead of /s/
05:59 ^🔗	Coderjoe	http://www.fanfiction.net/r/7573167/
05:59 ^🔗	bsmith093	thats useful go, automation!
06:00 ^🔗	bsmith093	does the r match the s for the same story
06:00 ^🔗	arrith	underscor: np. eh well, probably not that far
06:00 ^🔗	bsmith093	same #
06:00 ^🔗	underscor	yes
06:00 ^🔗	Coderjoe	yes
06:00 ^🔗	bsmith093	whhoohoo ! so easy then
06:01 ^🔗	Coderjoe	there are also communities and forums that need archiving
06:01 ^🔗	bsmith093	are those braindead simple url too
06:01 ^🔗	Coderjoe	communities are not
06:02 ^🔗	Coderjoe	nor are forums
06:02 ^🔗	bsmith093	oy well cant have everything
06:02 ^🔗	bsmith093	oh wait, I can, GO ARCHIVETEAM!
06:02 ^🔗	arrith	Coderjoe: you wouldn't happen to have found an atom/rss feed for reviews have you?
06:03 ^🔗	arrith	since the more structured the less darkarts html parsing
06:04 ^🔗	Coderjoe	they might be on that update feed
06:04 ^🔗	Coderjoe	(the first one, /atom/j/0/2/0)
06:04 ^🔗	bsmith093	best part about this script is its much lighter pon my cpu and disk io
06:04 ^🔗	Coderjoe	it will list the story, and I think a new review might cause it to go on that
06:05 ^🔗	Coderjoe	I know it does chapters
06:05 ^🔗	bsmith093	just passed 0006000
06:06 ^🔗	Coderjoe	nope. this story has one review posted a couple weeks ago
06:06 ^🔗	Coderjoe	but is listed on the update feed with an update date of today. I think they posted a new chapter
06:06 ^🔗	bsmith093	see if people wouldn't write so much we'd have less work :)
06:09 ^🔗	bsmith093	its 10927est so off to bed for me, not leaving thought will be asleep
06:11 ^🔗	bsmith093	quick thought here it would be great if once we have everything eatch out for ompleted sotries and pull them from the scrape queue
06:11 ^🔗	bsmith093	completed stories and pull them out
06:12 ^🔗	bsmith093	gnight
06:13 ^🔗	Coderjoe	still want to scrape for reviews. and there is nothing that says the author can't revise something
06:14 ^🔗	arrith	yeahh i was thinking author edits
06:14 ^🔗	arrith	and author comments
06:14 ^🔗	arrith	i don't have experience with this kind of stuff but i'm hoping the header last modified is accurate in this case
06:15 ^🔗	arrith	bsmith093: you're checking for existing stories?
06:15 ^🔗	arrith	since if fanfiction.net is blocking him ( bsmith093 ) then he'd just get a big list of false negatives ;/
06:16 ^🔗	underscor	Do new comments change the "update date"?
06:16 ^🔗	Coderjoe	no
06:16 ^🔗	underscor	ok
06:16 ^🔗	Coderjoe	at least I don't think so
06:17 ^🔗	Coderjoe	that's something I haven't specifically checked. I did find that the stories I looked at had a newer update date than the last review
06:20 ^🔗	arrith	underscor: for ease of notation, you can have the 999 stuff like this: seq -w 0 $((10**7 - 1))
06:20 ^🔗	arrith	or $[10**7 - 1]
06:20 ^🔗	Coderjoe	bleh
06:20 ^🔗	arrith	hard for me at least to visually see how many 9s there are
06:20 ^🔗	Coderjoe	seq was another compatability issue
06:20 ^🔗	godane	i have 87 episodes of crankygeeks
06:21 ^🔗	arrith	Coderjoe: ohh yeah. trying to shoehorn seq stuff into jot on osx
06:21 ^🔗	godane	i'm just getting ipod format since its only 105mb after episode 70
06:22 ^🔗	godane	mostly so i can fit 40 episodes onto a dvd
06:23 ^🔗	NotGLaDOS	At that size, don't you mean approx 47?
06:23 ^🔗	arrith	4.3 for gibibytes
06:23 ^🔗	arrith	about
06:23 ^🔗	godane	more like 43
06:23 ^🔗	NotGLaDOS	Ah.
06:23 ^🔗	NotGLaDOS	I have a 4.7 GB DVD-R here.
06:23 ^🔗	godane	some are 115mb
06:23 ^🔗	arrith	4.7 uses the base 10 'gigabyte' harddrive mfw scammers use
06:24 ^🔗	NotGLaDOS	Also, why is this not changing nick to NotGL-
06:24 ^🔗	arrith	4.7 gigabyte to gibibyte= 4.3772161006927490234375 gibibytes
06:24 ^🔗	NotGLaDOS	Ah
06:25 ^🔗	arrith	underscor: you gotta have a github repo called "DON'T LOOK HERE" then just secretly push stuff like you ff.net work ;P
06:26 ^🔗	NotGLaDOS	<*status> \| STR_IDENT \| 1 \| Yes \| irc.underworld.no \| NotGLaDOS!STR_IDENT@ip188-241-117-24.cluj.ro.asciicharismatic.org \| 3 \|
06:26 ^🔗	NotGLaDOS	I don't get it.
06:27 ^🔗	NotGLaDOS	...wait, is my nick NotGLaDOS?
06:27 ^🔗	arrith	it is
06:27 ^🔗	Coderjoe	yes
06:27 ^🔗	NotGLaDOS	...damn quassel playing tricks on me
06:28 ^🔗	NotGLaDOS	And fixed, with hackery.
06:28 ^🔗	Coderjoe	<NotGLaDOS> And fixed, with hackery.
06:28 ^🔗	Coderjoe	try /nick YourNewNick
06:28 ^🔗	NotGLaDOS	I had to force a module to send a false IRC command on the in direction, because it was displaying my nick as STR_IDENT
06:29 ^🔗	underscor	Wheee
06:29 ^🔗	underscor	Even farther!
06:29 ^🔗	underscor	http://pastebin.com/MWsp8Fv3
06:29 ^🔗	arrith	underscor: progress?
06:29 ^🔗	underscor	arrith: Yep :)
06:29 ^🔗	arrith	ahh pretty nice
06:30 ^🔗	arrith	underscor: at this point are you echoing extracted data?
06:30 ^🔗	arrith	or is there stuff it's doing on the bg that isn't echoed?
06:30 ^🔗	Coderjoe	story ID number in xml?
06:31 ^🔗	Coderjoe	(yes, you have the directory, but the number in the xml can help make sure that it can be correlated if separated and/or renamed)
06:33 ^🔗	underscor	arrith: Everything its doing is echoed
06:33 ^🔗	underscor	Coderjoe: Good idea, added
06:33 ^🔗	underscor	it's*
06:36 ^🔗	bsmith093	is there arepo yet>
06:36 ^🔗	arrith	underscor: what do you use to deal with xml in bash?
06:36 ^🔗	arrith	bsmith093: kind of underscor's pet project at this point i think
06:36 ^🔗	underscor	yeah
06:36 ^🔗	underscor	xml handling is done in php
06:36 ^🔗	arrith	bsmith093: also, you might want to doublecheck that you're getting some good nums and not all bad
06:37 ^🔗	arrith	ohh
06:37 ^🔗	arrith	underscor: cheater! :P
06:37 ^🔗	underscor	$f = create_function('$f,$c,$a','
06:37 ^🔗	underscor	$xml = new SimpleXMLElement("<?xml version=\"1.0\"?><{$root_element_name}></{$root_element_name}>");
06:37 ^🔗	underscor	foreach($a as $k=>$v) {
06:37 ^🔗	underscor	function assocArrayToXML($root_element_name,$ar)
06:37 ^🔗	underscor	{
06:37 ^🔗	underscor	if(is_numeric($k))
06:37 ^🔗	underscor	$k="v".$k;
06:37 ^🔗	underscor	if(is_array($v)) {
06:37 ^🔗	underscor	$ch=$c->addChild($k);
06:37 ^🔗	underscor	$f($f,$ch,$v);
06:37 ^🔗	underscor	} else {
06:37 ^🔗	underscor	$c->addChild($k,$v);
06:37 ^🔗	underscor	}
06:37 ^🔗	underscor	}');
06:37 ^🔗	underscor	$f($f,$xml,$ar);
06:37 ^🔗	underscor	return $xml->asXML();
06:37 ^🔗	underscor	}
06:37 ^🔗	bsmith093	yeah i think im getting false ngs
06:37 ^🔗	arrith	bsmith093: are you getting any positives?
06:38 ^🔗	bsmith093	random check
06:38 ^🔗	arrith	arrith: as in any in goodlist.txt?
06:38 ^🔗	bsmith093	0005543
06:38 ^🔗	arrith	underscor: ah, not too complicated
06:39 ^🔗	arrith	underscor: could probably rewrite that in bash..
06:39 ^🔗	bsmith093	in goodlist but not there in firefox
06:39 ^🔗	SketchCow	WHAT THE HELLO HI
06:39 ^🔗	bsmith093	check please
06:39 ^🔗	SketchCow	OK, my internet is back.
06:39 ^🔗	bsmith093	hey hey hey its SkeeetchCow
06:39 ^🔗	SketchCow	Jesus, lot of backlog.
06:39 ^🔗	NotGLaDOS	I know.
06:39 ^🔗	NotGLaDOS	I looked a the channel and thought "Fuck it."
06:40 ^🔗	bsmith093	fanfiction.net/s/0005543
06:40 ^🔗	SketchCow	I can't do that.
06:40 ^🔗	SketchCow	So bsmith093 is trying to save fanfiction?
06:40 ^🔗	arrith	bsmith093: yep dang. that is a false positive
06:40 ^🔗	bsmith093	well me and underscor
06:40 ^🔗	bsmith093	so its not just me
06:40 ^🔗	arrith	SketchCow: bsmith093 really wants to, underscor is doing a lot of stuff, i'm poking around with parts of it and Coderjoe is helping
06:40 ^🔗	SketchCow	So my question is, what's going on?
06:40 ^🔗	SketchCow	It's shutting down?
06:40 ^🔗	underscor	No
06:40 ^🔗	arrith	SketchCow: no, all pre-emptive
06:40 ^🔗	bsmith093	ffnet seq id check
06:41 ^🔗	underscor	Premptive
06:41 ^🔗	SketchCow	OK.
06:41 ^🔗	SketchCow	So remember, if it's pre-emptive, don't rape it.
06:41 ^🔗	bsmith093	if it was id make sure google knew
06:41 ^🔗	underscor	SketchCow: ofc
06:41 ^🔗	SketchCow	That's all I can really contribute.
06:41 ^🔗	SketchCow	Looks javascript free.
06:41 ^🔗	underscor	I like raping sites though :(
06:41 ^🔗	arrith	SketchCow: they have a pretty light blocking trigger finger i think. at least they blocked bsmith093 somehow for some reason i think
06:41 ^🔗	SketchCow	Have we considered just pinging them?
06:41 ^🔗	SketchCow	Going HEY WE WANT A COPY
06:41 ^🔗	SketchCow	Or no
06:41 ^🔗	bsmith093	and it has atome feeds for everything ! whooA!
06:41 ^🔗	arrith	not sure if anyone thought of trying that
06:41 ^🔗	underscor	They block IA
06:42 ^🔗	underscor	so they're a hostile target
06:42 ^🔗	underscor	<sunglasses>
06:42 ^🔗	bsmith093	i keep saying use googlebot
06:42 ^🔗	arrith	yeah, i think based on blocking IA people figured to not try
06:42 ^🔗	underscor	bsmith093: No, I mean
06:42 ^🔗	underscor	They block the wayback machine
06:42 ^🔗	underscor	It doesn't actually stop us
06:42 ^🔗	bsmith093	so switch the useragent
06:42 ^🔗	underscor	???
06:42 ^🔗	arrith	one guy, Teaspoon / tsp i guess worked on this a little while ago but he hasn't been seen and people haven't seen how far he got
06:43 ^🔗	underscor	bsmith093: What do you mean? Wayback Machine isn't going to switch its useragent...
06:43 ^🔗	underscor	They obey robots.txt for legal reasons
06:43 ^🔗	arrith	bsmith093: they haven't blocked underscor's efforts afaik
06:43 ^🔗	underscor	I'm up to 71k
06:43 ^🔗	arrith	bsmith093: so he doesn't need to change his UA, at least not yet
06:43 ^🔗	bsmith093	underscor: ohhh, ok then that makes much more sense
06:43 ^🔗	underscor	(just checking IDs
06:43 ^🔗	underscor	)
06:44 ^🔗	bsmith093	i suppose that so they can say well we didnt save u cause u blocked us
06:44 ^🔗	arrith	underscor: are you still doing that check for "Last" in the header?
06:44 ^🔗	underscor	Yeah
06:44 ^🔗	arrith	underscor: since i think bsmith093 just found a false positive, he checked
06:44 ^🔗	arrith	here
06:44 ^🔗	arrith	underscor: http://www.fanfiction.net/s/0005543/
06:45 ^🔗	bsmith093	still dead for me
06:45 ^🔗	arrith	oh wait
06:45 ^🔗	arrith	i might've mixed up dead and not dead
06:45 ^🔗	underscor	Not a story
06:45 ^🔗	underscor	var=`curl -s -I http://www.fanfiction.net/s/0005543/\|grep Last`;if [ -z $var ]; then echo "Not a story";else echo "Story";fi
06:45 ^🔗	arrith	bsmith093: goodlist is badlist and badlist is goodlist
06:45 ^🔗	underscor	Doesn't trip up mine
06:45 ^🔗	bsmith093	ure kidding me!?
06:45 ^🔗	underscor	lol
06:45 ^🔗	arrith	bsmith093: just rename them when it's done :P
06:45 ^🔗	bsmith093	checing bad list then
06:45 ^🔗	arrith	i blame the error codes
06:46 ^🔗	bsmith093	0009863
06:47 ^🔗	bsmith093	its good u did reverse the polarity
06:47 ^🔗	bsmith093	yay tom baker
06:47 ^🔗	arrith	bsmith093: yeah looks like a story
06:47 ^🔗	arrith	heh, yeahh
06:47 ^🔗	arrith	i did say i didn't check the script :P
06:47 ^🔗	bsmith093	stop and switch or keepgoing
06:47 ^🔗	underscor	77000 now
06:48 ^🔗	bsmith093	underscor: what are those numbers u keep giving
06:48 ^🔗	underscor	id's I've checked up to
06:48 ^🔗	SketchCow	OK, this needs to go to another channel.
06:48 ^🔗	underscor	for story/not story
06:48 ^🔗	underscor	:(
06:48 ^🔗	SketchCow	#fanboys or #fanfriction
06:48 ^🔗	bsmith093	HOW R U THAT FAST?
06:48 ^🔗	underscor	second one
06:48 ^🔗	bsmith093	K THEN
06:48 ^🔗	bsmith093	caps
06:49 ^🔗	arrith	bsmith093: keep going
06:49 ^🔗	bsmith093	k
06:49 ^🔗	arrith	SketchCow: you are good with those names
06:49 ^🔗	bsmith093	remind me i 7hrs hen its done
07:05 ^🔗	arrith	reading the logs in the topic brought up a question that i didn't see answered: does archive.org archive porn?
07:06 ^🔗	arrith	or IA rather
07:08 ^🔗	underscor	^
07:09 ^🔗	underscor	SketchCow
07:09 ^🔗	arrith	SketchCow: does the Internet Archive archive porn?
07:09 ^🔗	underscor	lol
07:23 ^🔗	DFJustin	http://www.archive.org/details/70sNunsploitationClipsNunsBehavingBadlyInBizarreFetishFilms
07:23 ^🔗	Coderjoe	ugh
07:24 ^🔗	Coderjoe	you had to link to the one I had seen before
08:04 ^🔗	SketchCow	Why do people ask that
08:04 ^🔗	SketchCow	Since Porn simply means "Material considered sexually or morally questionable by random community standards", of course it does.
08:05 ^🔗	SketchCow	So does google and so does facebook
08:05 ^🔗	arrith	good
08:06 ^🔗	arrith	SketchCow: what irc bouncer do you use? znc or irssi maybe?
08:13 ^🔗	SketchCow	Irssi
08:14 ^🔗	SketchCow	Pumped through a screen session
08:15 ^🔗	arrith	yeahh. i gotta learn irssi.
08:27 ^🔗	SketchCow	It's not so bad.
08:27 ^🔗	SketchCow	I use a screen session that puts the channel list along the right, like mIRC used to.
08:27 ^🔗	SketchCow	Also, I put this on the machine that runs textfiles.com and a bunch of services.
08:27 ^🔗	SketchCow	So I know INSTANTLY if something's wrong with the machine.
08:30 ^🔗	arrith	well you can't really right click on things and do other gui-ish stuff that you can in xchat. i can totally see me using irssi for logging but for everyday stuff i'm not sure yet
08:35 ^🔗	dnova	irssi supremacy
08:41 ^🔗	SketchCow	You can if you have an ssh client that makes URLs alive.
08:41 ^🔗	SketchCow	And people? Fuck people
08:41 ^🔗	SketchCow	They're all the same
08:42 ^🔗	SketchCow	who needs to right click on them
08:45 ^🔗	dnova	SketchCow: are you shooting all video on dslrs?
08:46 ^🔗	SketchCow	Yes.
08:46 ^🔗	dnova	that became a thing pretty quickly
08:51 ^🔗	chronomex	apparently it doesn't suck much at all.
08:51 ^🔗	chronomex	sensor is sensor, and dslrs often have nice sensor.
08:52 ^🔗	dnova	and excellent glass
08:52 ^🔗	dnova	and more variety
08:52 ^🔗	dnova	I don't think anything about it sucks
09:08 ^🔗	SketchCow	Some things suck.
09:08 ^🔗	SketchCow	But they're quite doable for what they are.
09:09 ^🔗	SketchCow	I have to also point out that I was trained, at 20, to be able to unload, canister, and then reload and thread 16mm film into a set of reels, all while inside a leather bag so they wouldn't be exposed to light.
09:09 ^🔗	SketchCow	Comparitively, this new material is even better than what that was giving me.
09:10 ^🔗	dnova	what are the things that suck?
09:10 ^🔗	dnova	I know very little about video production
09:12 ^🔗	SketchCow	http://www.youtube.com/watch?v=mEdBId3OuuY
09:16 ^🔗	dnova	no gui at the moment
09:42 ^🔗	ersi	arrith: I'm using irssi in a screen session, and I'm saying you're wrong. I'm clickin' links like a darned Mechanical Turk on speed
09:43 ^🔗	ersi	Most terminals convert text that it things is a link, to a clickable element. PuTTY (Win/*nix), gnome-terminal, rxvt, xterm, iTerm and co all do
10:16 ^🔗	arrith	were passwords ever reset on the wiki?
10:16 ^🔗	arrith	i haven't logged in for a while and i'm having trouble. my password autocompletes but the login doesn't work
10:22 ^🔗	SketchCow	Maybe
10:22 ^🔗	SketchCow	I cleared out one-offs who did nothig
10:24 ^🔗	arrith	SketchCow: could you look into User:Arrith really quick?
10:24 ^🔗	dnova	just reset your password
10:24 ^🔗	arrith	dnova: i was looking for a page for that
10:24 ^🔗	arrith	didn't find one
10:25 ^🔗	ersi	should be linked on the login page
10:26 ^🔗	arrith	ersi: you sure it's there? might just be me but i'm not seeing anything. just "Create an account."
10:29 ^🔗	ersi	hmm
10:30 ^🔗	ersi	huh, weird. alright.. there's no special page for that
10:30 ^🔗	dnova	really?
10:30 ^🔗	arrith	well the signup doesn't have an email, usually resetting a password involves sending an email out
10:32 ^🔗	dnova	welp.
11:00 ^🔗	kin37ik	how did the crawls from yesterday turn out?
11:11 ^🔗	arrith	well until further notice i am now arrith1 on the wiki
11:17 ^🔗	dnova	what is python used for in the splinder download process?
12:15 ^🔗	emijrp	testing script to download all wikkii.com wikis
12:15 ^🔗	emijrp	WIKIFARMS ARE NOT TRUSTWORTHY.
12:17 ^🔗	chronomex	k
12:17 ^🔗	arrith	indeed
12:19 ^🔗	arrith	btw if any Administrators get a chance, could one of them merge User:Arrith with User:Arrith1 please?
12:19 ^🔗	arrith	oh wait, seems only SketchCow can do that
12:20 ^🔗	emijrp	i guess no
12:20 ^🔗	emijrp	what nick do you want?
12:20 ^🔗	emijrp	i mean, sysop can merge pages
12:21 ^🔗	emijrp	user accounts i think it is impossible
12:22 ^🔗	dnova	if someone has a spare moment, could they put "Uploaded; still downloading more" next to my name in the splinder status table on the wiki?
12:22 ^🔗	arrith	emijrp: http://www.mediawiki.org/wiki/Extension:User_Merge_and_Delete says "merge (refer contributions, texts, watchlists) of a first account A to a second account B"
12:22 ^🔗	emijrp	mm, looks like possible mergin accounts
12:22 ^🔗	arrith	which would be good enough for me
12:22 ^🔗	emijrp	but are extensions, which need to be installed separately
12:23 ^🔗	emijrp	not sure if jason is going to install it only for you
12:23 ^🔗	emijrp	: )))
12:23 ^🔗	arrith	emijrp: http://archiveteam.org/index.php?title=Special:Version says it's already installed :P
12:24 ^🔗	arrith	although i am still curious why i can't get in on my original acct
12:24 ^🔗	emijrp	ah man, he used it to merge spam users, i forgot
12:24 ^🔗	arrith	dnova: done
12:24 ^🔗	dnova	thanks
12:25 ^🔗	arrith	yeah on that topic, there are a lot of pages with {{delete}}
12:25 ^🔗	arrith	and by a lot i mean more than 5 heh
12:27 ^🔗	dnova	we still need splinder downloaders
12:28 ^🔗	emijrp	why my userpage on AT wiki has 8000 pageviews?
12:28 ^🔗	arrith	emijrp: it's a very nice page
12:28 ^🔗	arrith	btw, the time stamp on the latest archiveteam.org wiki dump at http://www.archiveteam.org/dumps/ is 15-Mar-2009 :\|
12:28 ^🔗	arrith	hardly "weekly"
12:31 ^🔗	emijrp	http://code.google.com/p/wikiteam/downloads/list?can=2&q=archiveteam
12:31 ^🔗	emijrp	i do weekly, but i dont upload them
12:31 ^🔗	arrith	aha, didn't think of wikiteam but now that i think about it that makes sense
12:31 ^🔗	arrith	emijrp: images though?
12:31 ^🔗	emijrp	i dont upload images to googlecode
12:32 ^🔗	emijrp	only 4gb of hosting
12:32 ^🔗	emijrp	http://www.archive.org/search.php?query=wikiteam%20archiveteam
12:32 ^🔗	arrith	hm. i'm not sure where a good host would be. my first thought is archive.org
12:33 ^🔗	arrith	hmm that's good but, are those maintained? seems to be from August and July
12:33 ^🔗	emijrp	http://www.referata.com/ another wikifarm unstable
12:33 ^🔗	dnova	what is a wikifarm
12:34 ^🔗	emijrp	free hosting for wikis
12:34 ^🔗	dnova	oh.
12:36 ^🔗	emijrp	referata is for semantic wikis
12:36 ^🔗	emijrp	cool stuff
14:06 ^🔗	dnova	downloading it:pornoromantica
14:16 ^🔗	emijrp	wtf is that
14:19 ^🔗	dnova	I sure do not know
14:45 ^🔗	rude___	SketchCow have you seen this filter for eliminating aliasing in 5Dmk2 video? http://www.mosaicengineering.com/products/vaf-5d2.html
15:02 ^🔗	underscor	rude___: That's rad!
15:06 ^🔗	rude___	yup, funny that I've never noticed aliasing that bad in 5Dmk2 video before seeing the demos
15:06 ^🔗	underscor	http://www.m0ar.org/6346
15:06 ^🔗	underscor	This is amazing
15:06 ^🔗	underscor	(it's not actually porn)
15:11 ^🔗	dnova	pornoromantica is up to 29mb.
15:16 ^🔗	emijrp	wtf is that
15:17 ^🔗	dnova	someone's splinder account
15:17 ^🔗	dnova	up to 32mb now
15:20 ^🔗	Ymgve	underscor: how long do I have to watch before it gets amazing
15:21 ^🔗	underscor	like 2 mins
15:21 ^🔗	underscor	right after she says "I think he wants to fuck me"
15:21 ^🔗	Ymgve	ah, there
15:21 ^🔗	emijrp	and... ?
15:23 ^🔗	dnova	emijrp: ?
15:24 ^🔗	emijrp	she says that and what happens?
15:24 ^🔗	dnova	oh. I have no idea.
15:24 ^🔗	emijrp	I'm an archivist.
15:24 ^🔗	emijrp	I'm confused.
15:24 ^🔗	dnova	he archives the shit out of her
15:24 ^🔗	emijrp	WHAT.
15:27 ^🔗	dnova	emijrp, download some splinder
15:27 ^🔗	dnova	these last bunch of profiles are a real bitch
15:27 ^🔗	emijrp	If I download splinder, who the hell is going to download wikis?
15:27 ^🔗	underscor	not really
15:28 ^🔗	underscor	you have to watch it!
15:28 ^🔗	underscor	I don't want to spoil it
15:35 ^🔗	Schbirid	i downloaded ~14 gb splinder, some might be unfinished. cant continue. where to put it?
15:36 ^🔗	dnova	ask SketchCow for a slot and then use the upload-dld.sh script
15:36 ^🔗	dnova	upload-finished.sh rather
15:44 ^🔗	Schbirid	ok
15:44 ^🔗	Schbirid	SketchCow: i need a place to upload splinder downloads
17:38 ^🔗	SketchCow	You got it.
17:40 ^🔗	SketchCow	Has anyone else in here gotten calls/contact from reporters wanting to do an article on archive team?
17:41 ^🔗	yipdw	nope
17:41 ^🔗	dnova	me either
17:44 ^🔗	SketchCow	A fairly terrible article is going to come out, and I apologize in advance for it.
17:44 ^🔗	dnova	how did you find out about it
17:44 ^🔗	SketchCow	I did an interview for it.
17:45 ^🔗	SketchCow	I didn't realize who was writing it, he used an intermediary who did not identify him as the author, after I repeatedly refused to interact with him.
17:45 ^🔗	SketchCow	Now I found out and I have been yelling.
17:45 ^🔗	SketchCow	I'm good at yelling.
17:45 ^🔗	underscor	lol
17:46 ^🔗	yipdw	oh, was it Talmudge and/or Schwartz
17:46 ^🔗	closure	they're in your voicemail, archiving your archiving
17:48 ^🔗	SketchCow	Yes
17:49 ^🔗	underscor	yipdw: You know them?
17:49 ^🔗	yipdw	sneaky bastards
17:49 ^🔗	yipdw	not personally
17:49 ^🔗	yipdw	I do remember seeing Mattattattattattattattattahias Schwartz's article on Internet trolling a while back, though
17:50 ^🔗	underscor	hahaha
17:52 ^🔗	yipdw	I figure if he wants to invoke the pumping lemma on his name, it's fair game
17:52 ^🔗	*	closure listens to a 1 gb WD green sata drive fail to spin up in my external dock
17:53 ^🔗	underscor	:(
17:53 ^🔗	closure	wonder if it will do better on internal SATA.. will have to try later
17:54 ^🔗	closure	huh, on one dock it does nothing, on the other I can hear the motor fail to quite spin it
17:55 ^🔗	SketchCow	Or you're torturing it and it's randomly going up and down.
17:56 ^🔗	dnova	us? torture hard drives?
17:56 ^🔗	dnova	never!
17:57 ^🔗	SketchCow	Schbirid: Need the slot!
18:01 ^🔗	closure	oh good, everything on this drive is still present on some 50 or so dvds. urk.
18:01 ^🔗	dnova	haha
18:29 ^🔗	dnova	god damnit I just lost about 15 hours worth of downloading.
18:29 ^🔗	SketchCow	See, that's what you get
18:29 ^🔗	SketchCow	"ha ha you lost so much stOH FUCK I LOST MY STUFF"\
18:29 ^🔗	SketchCow	Jesus did it to you
18:30 ^🔗	SketchCow	Jesus, he likes insta-parables these days
18:30 ^🔗	dnova	I wasn't laughing at closure!! well I kinda was.
18:31 ^🔗	dnova	argh!
18:32 ^🔗	SketchCow	Jesus knew
18:35 ^🔗	underscor	hahaha
18:39 ^🔗	emijrp	haha closure and dnova lost stuff
18:40 ^🔗	emijrp	i lost 1.5TB some months ago, so, it cant get worse
18:42 ^🔗	emijrp	obviously, currently unique stuff is being destroyed in the Internet, what is your estimate in Megabytes?
18:43 ^🔗	emijrp	mb/hour
18:46 ^🔗	underscor	Thank you for placing your order with the Comprehensive Large-Array data Stewardship System.
18:52 ^🔗	dnova	what's that?
18:57 ^🔗	emijrp	I sure do not know
18:58 ^🔗	dnova	haha
18:58 ^🔗	dnova	I lost pornoromantico :(
18:58 ^🔗	dnova	it was over 400mb
18:58 ^🔗	SketchCow	It's the Big Brother program for fat people
19:00 ^🔗	underscor	hahahah
19:00 ^🔗	underscor	It's NOAA's tape access system
19:00 ^🔗	dnova	you want noaa tapes?
19:00 ^🔗	underscor	I need a piece of historical data for oceanography class
19:01 ^🔗	dnova	awesome.
19:07 ^🔗	emijrp	Get your piece of oceanographic data http://en.wikipedia.org/wiki/Exploding_whale
19:12 ^🔗	underscor	I'm digging through noaa's various public FTP servers
19:12 ^🔗	underscor	There so much old cruft and stuff, it's really cool
19:12 ^🔗	underscor	TODO: Ring bob and tell him to actually upload the data here
19:12 ^🔗	underscor	This directory contains files related to the March 1993 Blizzard. It
19:12 ^🔗	underscor	includes a report on the storm and related data files described in
19:12 ^🔗	underscor	the report. NCDC's homepage provides easy access to this directory.
19:13 ^🔗	underscor	That file was updated 8/15/97
19:13 ^🔗	underscor	and the data is still not there
19:13 ^🔗	underscor	hahaha
19:13 ^🔗	dnova	damnit, bob
19:15 ^🔗	Schbirid	damn, only 5 of 14gb were "finished" data
19:15 ^🔗	bsmith093	noaa has public ftp archives ?!
19:15 ^🔗	underscor	ftp.ncdc.noaa.gov, ftp.ngdc.noaa.gov
19:16 ^🔗	underscor	ftp.nodc.noaa.gov
19:18 ^🔗	emijrp	If that FTP is up since 1997, it is almost as trustworthy as Internet Archive.
19:25 ^🔗	bsmith093	anyone want to catch me up on how we're archiving ffnet
19:25 ^🔗	bsmith093	im also in #fanfriction
19:29 ^🔗	godane	looks like podtrac doesn't keep ipod format of crankygeeks after 100 episode number
19:29 ^🔗	godane	likely mpeg4 is hosted by pcmag
19:30 ^🔗	bsmith093	are u grabbing all their feeds, ausio ogg, etc
19:30 ^🔗	bsmith093	*audio
19:30 ^🔗	godane	no
19:30 ^🔗	godane	mp3 is down
19:31 ^🔗	godane	can you guys please start mirroring crankygeeks
19:31 ^🔗	godane	i didn't think it was this bad yet
19:31 ^🔗	bsmith093	when i checkedd only 4 mps were dead
19:32 ^🔗	godane	its only 100-103 that are down
19:32 ^🔗	godane	ok
20:09 ^🔗	SketchCow	Anyone have idea to do a sed that does nothing BUT A-Za-z0-9 ?
20:09 ^🔗	PatC	What's a sed?
20:10 ^🔗	SketchCow	Found it. sed 's/[^a-zA-Z0-9]//g'
20:12 ^🔗	godane	looks like all links to 101-103 are died
20:12 ^🔗	godane	for crankygeeks
20:13 ^🔗	Schbirid	SketchCow: alternatively grep -Eo '[a-zA-Z0-9]' might do what you want
20:23 ^🔗	SketchCow	I would move ./Amiga Dream 01 - Nov 1993 - Page 32.jpg to AmigaDream01-Nov1993-Page32.jpg
20:23 ^🔗	SketchCow	I would move ./Amiga Dream 01 - Nov 1993 - Page 67.jpg to AmigaDream01-Nov1993-Page67.jpg
20:23 ^🔗	SketchCow	I would move ./Amiga Dream 01 - Nov 1993 - Page 22.jpg to AmigaDream01-Nov1993-Page22.jpg
20:23 ^🔗	SketchCow	I would move ./Amiga Dream 01 - Nov 1993 - Page 15.jpg to AmigaDream01-Nov1993-Page15.jpg
20:23 ^🔗	SketchCow	I would move ./Amiga Dream 01 - Nov 1993 - Page 45.jpg to AmigaDream01-Nov1993-Page45.jpg
20:23 ^🔗	SketchCow	Tah dah.
20:25 ^🔗	emijrp	SketchCow: did you read my suggestion for human ocr at IA ?
20:25 ^🔗	SketchCow	That you think the non-human OCR sucks and it should be replaced?
20:25 ^🔗	SketchCow	Was there more to it?
20:26 ^🔗	emijrp	replaced with a colaborative ocr, the technology for that exists
20:26 ^🔗	SketchCow	I see.
20:27 ^🔗	emijrp	i think i read about IA saying their books have OCR for blind people
20:27 ^🔗	emijrp	but... checking the .txt files on scanned books, O_O
20:32 ^🔗	emijrp	IA has a great potential, but most of its content is in useless formats
20:33 ^🔗	emijrp	I dont want to read a book in JPG/DJVU, I want an epub or a correct txt
20:33 ^🔗	SketchCow	You strike at the heart of an endemic issue with IA.
20:34 ^🔗	SketchCow	I need to do a quick errand.
20:34 ^🔗	SketchCow	But it's a big issue and there may not be an easy solution.
20:34 ^🔗	SketchCow	It's a political technical issue
20:46 ^🔗	emijrp	From Wikipedia: Metapedia is a white nationalist and white supremacist,[2] extreme right-wing and multilingual online encyclopedia.[3][4][5]
20:46 ^🔗	emijrp	And it is a wiki.
20:46 ^🔗	emijrp	What is your opinion about archiving that?
20:46 ^🔗	emijrp	Discuss.
20:47 ^🔗	dnova	archive it
20:48 ^🔗	Schbirid	archive it
20:54 ^🔗	emijrp	Is it illegal to upload that into IA?
20:55 ^🔗	emijrp	I think it depends on servers location.
20:55 ^🔗	dnova	why would it be illegal?
20:56 ^🔗	emijrp	Speaking in positive tone about nazism is illegal in some jurisdictions.
21:21 ^🔗	SketchCow	Download it.
21:21 ^🔗	SketchCow	archive it.
21:51 ^🔗	arkhive	http://arstechnica.com/gaming/news/2011/11/gamepro-magazine-and-website-to-shutter-next-month-1.ars
21:51 ^🔗	chronomex	emijrp: there is no law against naziism in the United States of America
21:52 ^🔗	arkhive	December 5th
21:52 ^🔗	chronomex	"Congress shall make no law ... abridging the freedom of speech, or of the press ..."
22:44 ^🔗	godane	i'm using wget-warc to backup crankygeeks web site
23:07 ^🔗	bsmith093	for the ffnet archiving effort, where the list of good id#s?
23:08 ^🔗	arkhive	Will archiveteam backup GamePro, Waves on Google Wave or Knol?
23:08 ^🔗	arkhive	And did you guys backup Aardvark?
23:09 ^🔗	godane	is fanfiction.net going down?
23:11 ^🔗	bsmith093	godane: not, preemtive
23:11 ^🔗	godane	ok
23:13 ^🔗	chronomex	arkhive: I've heard some buzz about Knol. I haven't heard much about Wave. Are you interested in starting a subcommittee?
23:13 ^🔗	godane	can wget-warc sed out the main website?
23:13 ^🔗	bsmith093	id like to help with knol, if theres a script, cause the button option sounds reallyreally slow
23:13 ^🔗	godane	i want it to work locally and be possible to host it localy
23:15 ^🔗	chronomex	godane: the purpose of WARC is to preserve as close to the source material as possible. as such, altering a page before storing it into the WARC file is to be avoided. wget can --convert-links, but I don't know how this affects the .warc output.
23:15 ^🔗	alard	chronomex: --convert-links is safe to use with warc.
23:16 ^🔗	emijrp	arkhive: we have the metadata for knols, 700,000, now a scraper for the real content is needed
23:17 ^🔗	arkhive	chronomex: I'd like to help back Knol up.
23:19 ^🔗	arkhive	emijrp: can we write a script and have a server tell each connected client what knol to fetch. Like we did for backing up Google Video's Videos?
23:20 ^🔗	arkhive	I'm not too good on writing scripts though.
23:22 ^🔗	emijrp	im not sure, i dont know how people make that cool distributed scripts
23:22 ^🔗	chronomex	alard: excellent.
23:23 ^🔗	godane	wget-warc is not converting links
23:23 ^🔗	bsmith095	since everyone seems to be in here anyway, wheres the list of good id#s for ffnet
23:23 ^🔗	godane	i still get http://www.crankygeeks.com/favicon.ico
23:23 ^🔗	godane	like links
23:23 ^🔗	bsmith095	-k
23:23 ^🔗	godane	i did
23:23 ^🔗	godane	index.html download
23:24 ^🔗	godane	and i still get those links
23:24 ^🔗	bsmith095	wget-warc -mcpk
23:25 ^🔗	emijrp	arkhive: channel is #klol
23:27 ^🔗	godane	i'm still getting http://www.crankygeeks.com/favicon.ico links
23:27 ^🔗	godane	i also don't want it redownloading everything
23:28 ^🔗	godane	wget "http://www.crankygeeks.com/" --warc-file="crankygeeks" --no-warc-compression -mcpk
23:28 ^🔗	godane	thats what i used
23:28 ^🔗	godane	my wget-warc is wget
23:29 ^🔗	arkhive	GamePro's down December 5th...We should also start that. I've got storage and computers I can dedicate to backing it up.
23:30 ^🔗	dnova	how big is gampero
23:30 ^🔗	dnova	gamepro
23:30 ^🔗	dnova	and what is it
23:31 ^🔗	arkhive	A gaming blog, news, magazine website
23:31 ^🔗	arkhive	Not sure How big Gamepro's site is. Not sure how to check either.
23:31 ^🔗	emijrp	delicious was archived?
23:32 ^🔗	alard	godane: Be aware that compressing warcs is preferably done while wget is downloading, not with a post-processing gzip step. So if you do intend to gzip later, it's better to remove the --no-warc-compression
23:33 ^🔗	godane	alard: i want to make sure i don't get the same problem i have now
23:33 ^🔗	godane	if i compressed i will not know if it did corrently
23:34 ^🔗	alard	You can gunzip?
23:34 ^🔗	godane	not until i'm don't download
23:34 ^🔗	godane	*done
23:34 ^🔗	yipdw	hmm
23:34 ^🔗	yipdw	guess I'll snag a copy of http://wikileaks.org/the-spyfiles.html
23:35 ^🔗	godane	i'm not downloading 200mb website to find out it didn't convert the links
23:37 ^🔗	emijrp	yipdw: thanks, now this channel is being monitored by CIA, NSA and ETs.
23:37 ^🔗	bsmith095	ET? really, cool :)
23:37 ^🔗	yipdw	emijrp: bomb, bin Laden, airplane
23:37 ^🔗	alard	godane: You should do what you think is best, of course, but: 1. you can gunzip the warc.gz while downloading (it'll just print an error at the end, but you will see what's been downloaded so far) 2. bear in mind that the wget link conversion is always done at the end of the download, if I remember correctly.
23:37 ^🔗	balrog_	oh
23:37 ^🔗	bsmith095	correct
23:37 ^🔗	balrog_	anyone know of tools for scraping MediaWiki sites?
23:38 ^🔗	alard	godane: just do gunzip -c my.warc.gz \| less
23:38 ^🔗	emijrp	balrog: I call them WikiTeam tools.
23:38 ^🔗	chronomex	balrog: wikiteam produced some good tools. look in the archiveteam wiki
23:39 ^🔗	godane	alard: i may never finish download cause it keeps downloading blank previos pages
23:39 ^🔗	balrog	awesome. thanks!
23:40 ^🔗	godane	just to go back to the last 15 episodes at over 1500 even when there is only 237 episodes
23:40 ^🔗	emijrp	balrog: which wiki?
23:40 ^🔗	balrog	none that Archive Team would be interested in. it's not going away but I need a backup for various purposes
23:40 ^🔗	balrog	ATTENTION: This wiki does not allow some parameters in Special:Export, so, pages with large histories may be truncated
23:40 ^🔗	balrog	any way to get around that?
23:42 ^🔗	emijrp	only affects for +1000 revisions pages
23:42 ^🔗	emijrp	not the common case
23:42 ^🔗	balrog	ahh, yeah shouldn't have any of those here at all.
23:42 ^🔗	chronomex	balrog: I think you'll have to upgrade the wiki to fix that, but it's rare to cause problems.
23:42 ^🔗	balrog	:)
23:42 ^🔗	balrog	it's not a wiki I have access to.
23:43 ^🔗	balrog	but yeah shouldn't have +1000-rev pages.
23:43 ^🔗	bsmith095	underscor: do u have the list of valid ids for ffnet
23:43 ^🔗	godane	i really HATE mirror websites now
23:44 ^🔗	godane	there just always keep remirror the full site when i just what the update crap
23:44 ^🔗	godane	like xkcd.com
23:44 ^🔗	godane	httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo -n --disable-security-limits -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -#L500000000 --update xkcd.com
23:45 ^🔗	godane	thats what i did
23:45 ^🔗	bsmith095	i have an xkcd images script if anyione wants that
23:45 ^🔗	godane	i thought the --update will not fucking redownload files
23:46 ^🔗	bsmith095	why backup xkcd?
23:46 ^🔗	bsmith095	the whole thing?
23:46 ^🔗	godane	again
23:46 ^🔗	godane	i want to local host things on my local lan
23:46 ^🔗	bsmith095	wget-warc -mcpk --random-wait xkcd.com
23:47 ^🔗	godane	but that will not download imgs.xkcd.com i think
23:47 ^🔗	emijrp	http://arxiv.org/ is a great site but they have counter-archivists measures
23:52 ^🔗	godane	looks like stupid imgs.xkcd.com can't be mirrored with wget-warc
23:53 ^🔗	bsmith093	if u want imgs.xkcd.com here google this yaxkcdds.sh

irclogger-viewer