[00:00] let's see. error 7 is failure to connect to the host, and 56 is failure in receiving network data [00:01] xargs runs curl as many times as insatnce is set for, but the instant any one of them hit an error they all are left to finish up then the sript quits [00:01] right [00:01] that's how xargs works [00:02] oh ok, so is therea way to print an error and keep xarge going [00:03] http://tracker.archive.org/ff.net/numbers this file has the full list of ids to check, thats that the curl func does, xargs check multiple at once, or at least its supposed to [00:04] if you're getting error 7 or 56 "seemingly at random", it may be fanfiction.net [00:04] also, that script you posted is kind of nuts [00:05] i agree, not my scriot but the best arrith could do on short notice [00:05] you're creating a subshell and are evaluating a function and then executing it for every job [00:05] yipdw: but why is it nuts [00:05] there's a simpler way to do it :P [00:05] one moment [00:05] yay, so whats the better way [00:06] well, you could put the contents of the function into another sh file, and call it that way :) [00:06] the numbers, cause they are [00:06] I'm still not quite sure why there even needs to be another function [00:06] but that's a minor thing [00:07] this is why i gave the link, so you all could tweak and offer suggestions :P [00:13] oh, I see why [00:13] I guess xargs won't execute a shell function [00:13] no [00:14] bsmith093: out of curiosity, do you get the curl error if you run with instance_count=1 [00:14] not sure hold on [00:15] also, is it always on the same IDs [00:16] finally, I'm wondering if it would be more efficient to do this via spidering the fanfiction.net story indices [00:16] 76.2 megabytes of IDs is a lot [00:16] ? [00:16] how many of those IDs actually reference stories [00:16] probably several million [00:16] and thers only 10 mil max so why not be methodical [00:18] appears not [00:18] finally, I don't think that script actually archives stories in full [00:19] inst count =1 no problem, but its not in paralell, so it kinda defeats the puroose [00:19] purpose [00:19] how does it behave on stories with multiple chapters, e.g. http://www.fanfiction.net/s/5909536/? [00:19] wait just crashed yep same error [00:20] it doesn't actually grab them just checks if the id is valid [00:20] im building a linklist [00:21] hm [00:21] I maintain it would be more efficient (for you and for them) to start at the roots on http://www.fanfiction.net/ [00:21] and trace from there [00:21] by using WWW::Mechanize/Mechanize/etc. [00:21] I've got to run, though, so I can't provide an example [00:21] maybe later [00:22] usage of those tools does mean leaving bash and using Perl, Python, Ruby, or whatnot, but IMO those are better languages for this sort of stuff anyway [00:22] bbl for real [00:26] connection dropped out, what'd i miss [00:41] Nothing [00:42] how do i get wget --spider to give up a linklist fort he ehole site [00:42] Dunno off the top of my head [00:43] On a side note, I'm almost to 1,500,000 [00:43] Simply using this [00:43] for i in `cat numbers_[a-e][n-z] `;do var=`curl -A "ArchiveTeam/1.0 - Email archiveteam@k-srv.info for misbehavior or complaints" -I http://www.fanfiction.net/s/$i|grep Last`;echo -n "$i - ";if [ -z $var ]; then echo "Not a story";else echo "Story";echo $i>>stories_aa;fi;done [01:01] k-srv.info, who is that? [01:07] yipdw: took me like an hour to work out that xargs subshell thing. seriously. [01:08] yipdw: they want you to put stuff into another script then have xargs in your original script run *that* [01:08] yipdw: and i am quite proud of my (crazy) workaround :D [01:12] yipdw: btw underscor is doing his own script that's more thorough, what i'm doing is just a dirty/fast grab for the stories as really just a proof of concept. [01:20] hey im running underscor's thing with his files from ffnet tracker, and its picking up where he left off [01:21] still 81 days though [01:22] bsmith093: What do you mean where I left off? [01:22] the file stories aa is growning [01:22] growing [01:24] I know, I'm saying I didn't leave off anywhere [01:25] bsmith093: afaik that's a snapshot of his work, he's probably farther along than that [01:25] well ok then [01:25] oh, yeah [01:25] sorry [01:25] my bad [01:26] http://tracker.archive.org/ff.net/stories_0-1299999 [01:26] That might be of interest though [01:26] Those are all the valid ones [02:16] underscor: im running your script now, since yoru so much further ahead than me, and it keeps failing Running storyinator on id 0000004 Let's get some metadata. Frontpage Gotten Title is Little Helper Writen by Sheryl Nantus, whose userid is 3284 Placed in tv>>X-Files Tags are Rated: K+, English, F. Mulder & D. Scully, P:3-16-99 Published 3-16-99, updated Story has 38 reviews, which is 3 pages chapters in this story Making dir [02:17] That all looks correct [02:17] Do you have php [02:17] and do you have the php file xmlr.php? [02:17] yeah about, that could not open imput file xmlr.php [02:18] Did you download it? [02:18] :P [02:18] yes [02:19] man php says yes i do have php, but maybe not the right version or something [02:20] bsmith094: what's wrong with my script :( [02:21] still running arrith [02:21] it gets the job done of finding IDs. and in parallel! [02:21] oh [02:21] bsmith094: looks like it's working? [02:21] apparently [02:21] using underscors numbers because ho's got so many [02:24] now as for the actual downloading of the stories, well thats more complicated, according to whatever black arts this this is using http://pastebin.com/e5e4tvK5 [02:39] bsmith094: Unknown Paste ID! [04:18] arrith: I see [04:19] arrith: at that point, in my opinion, it probably is clearer to switch to a different programming language and use e.g. a thread pool [04:22] psssshh [04:23] yeah. i have a very small and light and elegant reimplementation of that in python written by a friend of mine using a threadpool even but eh, i don't know python yet [04:23] or, more appropriately [04:23] a queue [04:23] thread pool being an implementation detail, obviously :P [04:24] http://paste.pocoo.org/show/516501/ [04:25] ah, i don't think i know the difference between a pool and a queue [04:25] yeah, pretty much [04:25] they're different structures, not directly related [04:26] the idea being that you throw all of your tasks (IDs, in this case) into a queue, and then there exist multiple executors that dequeue a task, work on it, and then check it in [04:26] the thread pool is a way to limit the number of concurrent executors [04:26] ah [04:26] does python's multiprocess use a queue? [04:27] the map() function probably does [04:28] ah [04:36] im back, is the python code just an example? [04:37] anyway, now im trying to get underscor's storyinator.sh to work [04:49] i have 193 episodes of crankygeeks now [04:50] i also have all crankygeeks episodes posts [04:50] publicly availible yet [04:50] i have not uploaded anything yet [04:51] i have backed them up on dvd [04:51] i have md5sum file for making sure data is right [04:51] how many dvds [04:51] 3 so far [04:52] it will be at least 6 dvds when fully done [04:52] nice. that's hardly anything [04:52] wow [04:52] *6 single layer dvds [04:53] this is only one format [04:53] there's no reason to get the other formats [04:53] if you're getting the best one [04:53] i'm getting ipod one [04:53] the smaller ones can be recreated from those if necessary. [04:53] wait... really? [04:53] so mp3 [04:53] oh they're just podcasts? [04:53] I thought they were videos [04:54] there just podcasts [04:54] that should be fine then. [04:54] but its video podcasts [04:54] ... uh [04:54] ok... well, mp3 has no video [04:54] video with audio podcasts [04:55] you should be getting the highest quality version of them [04:55] i did for the first 70 [04:55] mpeg4 would have had to change to quicktime [04:56] cause mpeg4 became the ipod format [04:57] i thought mpeg4 was basically quicktime [04:58] there is .mp4 files then there is .mov files [05:00] mp4 is independent of quicktime [05:01] anyways the videos are not that big [05:01] backing up something is better then nothing [05:01] the isomedia mp4 container format is largely based on the quicktime container format. (note, quicktime is like AVI in this regard: both can contain a number of different codecs. the mp4 container is a bit more limited) [05:02] you don't need to backup the 1.6gb 720p videos of podcasts [05:02] godane: says who? [05:02] I guess if it's not that important to you, that's fine [05:03] I don't know anything about that podcast and I am personally not too concerned about it [05:03] but if it's worth doing, it's worth doing right, isn't it? [05:03] it takes along time to download and upload 1.6gb file [05:03] what's the deadline? [05:04] also crankygeeks doesn't have HD [05:04] the biggest file is like 140ishmb [05:04] so where did 1.6gb come from? [05:04] get the 140mb or whatever the best quality files are [05:05] i'm just getting the ipod one [05:05] sorry [05:05] I don't give a shit, but it sounds like you do [05:05] but only enough to half-ass it [05:05] i watch the videos [05:05] the is not big differents between the too [05:05] maybe they'll upload them to youtube for near-term preservation and availability [05:05] *twno [05:06] dnova: hey, i don't particularly care for most of the stuff on ffnet, either, but im still saving them preemptively, cause it would really suck if that much creativity went into the bitbucket [05:07] right [05:07] and I don't care about splinder on a personal level either [05:07] nor me [05:07] but I've spent lots of time and a decent amount of money to grab as much as I can [05:07] and I'm still grabbing. [05:07] at all, but as long as it was that easy to help out, i did' [05:08] hell yeah man! [05:08] they did a bang-up job with that. [05:08] now, ffnet, for being a fully automated site, is a pita to grab, all of it any way, pinging the ids just to see which urls are valid will take about a month [05:09] I will help with that if possible. just let me know. [05:09] I'm still not sure why you're going through all the IDs [05:09] it's not a huge time crunch with that project so don't stress too much about it [05:09] have you identified some problem with using the fanfiction.net indices? [05:09] e.g. have they blocked some stories from showing up? [05:09] we dont really have a script yet, weve, * and i mean underscor and arrith , have some tentative efforts going [05:10] yeah.. when it's a little more fleshed out I'll throw some hardware at it [05:10] yipdw: uhh no, i just want to save them bc thats a lot of work to dissappear [05:10] that's not the question I asked [05:10] he means why are you brute forcing the ID list [05:10] yes [05:10] fanfiction.net has, from what I can see, a perfectly usable story index [05:10] oh well , umm, thats the easiest way ive found [05:10] * yipdw sighs [05:10] where? [05:11] their web page [05:11] one moment [05:11] I have some time now, let me whip up a demo [05:11] we could scrape the feed, butthat goes forward not back [05:11] no no [05:11] I mean the page itself [05:11] e.g. http://www.fanfiction.net/play/ [05:11] ok then whip away, cause u lost me [05:11] every story is linked from these lists [05:11] (as far as I can tell) [05:12] if you have some counterexamples, I would like to hear them [05:12] ummm, but its easier just to grab the story ids directly [05:13] all u need then is to find out how many chapters each on is, and thats on the first page of each one [05:13] the initial implementation is easier, but: [05:13] (1) it requires a pre-filtering step that (you say) will take a month [05:13] (2) it's really inconsiderate [05:13] (fanfiction.net isn't dying) [05:13] heh [05:13] and the point of archiveteam, as far as I know, is to archive, not be assholes [05:14] the latter sometimes happens but not as an objective [05:14] to be fair, he's not trying to be an asshole at all [05:14] bear in mind that every GET for a story ID likely requires a database lookup, unless ff.net has done some caching along those lines [05:14] I know he's not [05:14] I'm just saying that brute-forcing is a pretty inconsiderate way to do thngs [05:14] which is also why I'm writing up an alternative [05:14] im just using curl to scrape the head of the urls [05:16] yipdw: so what's your alternative [05:16] you have a set of roots, right [05:16] yeah [05:16] namely, the fanfiction categories on the main page [05:16] ok [05:16] with u so far [05:16] each root contains a set of categories [05:16] each category contains a set of stories [05:16] therefore, there is no need to test each ID [05:17] and you can begin archiving stories immediately [05:17] uhuh, a so what wget --spider -m? [05:17] I don't know what tool you'll use; I'm writing a tool in Ruby at the moment [05:30] whoa my rubinius install is out of date [05:30] time to update [05:34] update it good [05:41] so anyway, wget -m witha ua changed to firefox seems to be saving the same links tructure as well, so no resorting of ids back into categories [05:47] did you apply a wait time (possibly with the random wait options as well?) [05:52] ok [05:53] https://gist.github.com/1432483 [05:53] that doesn't actually save anything yet, but it can be extended to do so [05:53] the idea is to demonstrate a more targeted approach [05:54] if you run that (use Ruby 1.9.3, JRuby in 1.9 mode, or Rubinius 2.0.0 in 1.9 mode) you'll see how it works [05:54] attaching an example run log now [05:54] attached [05:54] note, too, that paginated categories are treated as just more categories [05:55] there's some deduplication work to be done there, but [05:55] eh [05:56] one possibility for saving with the script I linked is to save each story as its own WARC, reviews and all; that'd eliminate the need for a separate review queue [05:56] that assumes that the unit of work you want to save is the story [05:56] which I think is true. [05:57] yipdw: That's pretty spiffy! [05:58] I think it's probably buggy [05:58] there are some duplicate names showing up; the link selection logic probably needs to be refined [05:58] but that's the idea [05:58] as a bonus, the number of instances you run can be carefully controlled by simply changing the size of the connection pool [06:01] aha [06:02] underscor: i was going to ping you to make sure you saw this discussion, yeah some interesting stuff [06:02] oh oops [06:02] my category-detection scheme fails on crossovers [06:08] heh, that's annoying [06:08] http://www.fanfiction.net/crossovers/movie/ has broken HTML [06:09] yipdw: well, its official, your ruby kicks my wget's ass [06:09] probably more efficient, too [06:10] keep in mind that this code doesn't actually save anything yet [06:10] I'm not sure how you want to do that [06:10] im fine with category/show/userid/story [06:11] and you have a repo, which makes updating SO much easier [06:11] also, I'm not sure how hard it would be to get wget-warc to do this [06:11] (haven't tried) [06:11] there are advantages to using that, such as making it easier to replicate fanfiction.net's structure [06:11] i still don't get why warc is important? [06:11] why did we have to compile wget-warc for splinder? [06:12] dnova: there's no official release of wget + WARC capabilities [06:12] because the warc features are not in most distro's package repos yet [06:12] this code would work for fictionpress as well, since they're identical [06:12] bsmith094: I think it's important to capture not only the story data but also the circumstances under which the capture was done [06:12] the warc features have been accepted into wget's mainline, however [06:12] WARC provides that [06:12] ah, interesting. [06:12] huh, well ok then [06:13] also, IA is set up to ingest WARCs, I think [06:13] yes, the wayback is set up to ingest warc pretty much automatically (once someone feeds the warc to it) [06:14] so something like Books/Harry Potter/1234567/2345678/blah.html [06:14] so do you just want to archive the text of the stories? [06:14] or are you after more than that? [06:14] ok its late or early, so gnight yall [06:14] because if it's just text, fanfiction.net's mobile site is actually better suited for this [06:14] (it's simpler) [06:15] yipdw: he's after just the text. I'd prefer a full warc set [06:15] Coderjoe: full WARC set of all stories, one story per WARC? [06:15] or a WARC archive of the whole site [06:15] well, IIRC, he wants the text, author comments, and reviews [06:15] ok [06:15] a warc for the entire site would require LOTS of ram, I think [06:15] dnova: yeah [06:16] I guess what I should be asking is [06:16] actually that would be a great bonus but ill take jus the stories if that all i can grab [06:16] what's the objective here [06:17] is the idea to take e.g. http://www.fanfiction.net/s/6635497/1/Plotting_The_Unknown_Future and wrap it into a WARC, comments, reviews and all? [06:17] for ingestion into IA? [06:18] anyway, I'll clean up that ff Ruby code and dump it into an AT repo on github [06:18] that loop { sleep 5 } bullshit needs to go [06:19] PSA: if anyone is doing sleeps like that in threads and you're not waiting on a periodic source, you have sinned [06:20] I'll take your word for it [06:20] arguably sleeping on periodic sources is a bad idea anyway [06:21] er, as a wait for [06:21] not a huge thing, and i feel like a jerk since i cant code wotrh a damn, but it would be just fantastic, if you could put the author profile page in there somewhere, as well as the reviews for each story as html, with the story [06:21] yipdw: underscor is kinda leading the design on that [06:22] arrith: cool [06:22] again, this Ruby stuff is just a PoC [06:22] i'm not sure what he's including but i'm hoping as much as possible [06:22] feel free to use or not use as needed [06:22] once he's done with his bash+php+perl thing i want to look over it and try to convert it to python as much as possible, make sure it's getting everything comprehensively enough, then integrate it with the universal tracker for periodic scrapes [06:23] true, that why i feel like a jerk, im throwing out ideas, that equal more work for the rest of you guys, and i cant really contribute anything, but bandwidth to run whatever scripts you finally come up with [06:23] alright. i don't know ruby but it looks pretty neat. i'll try to make sense of it [06:23] Before I forget yet again, SketchCow, can I have an rsync slot? I've got some berlios and a wayward chunk of Google Groups. [06:23] bsmith094: you could glance over a python tutorial :P [06:23] arrith: Ruby is perl in a dress. [06:23] bsmith094: http://learnpythonthehardway.org/ [06:23] arrith: do you know a good one? [06:23] beat me. [06:23] dnova: http://learnpythonthehardway.org/ [06:23] LOL [06:23] arrith: it starts at the roots -- the major subdivisions of the site [06:23] OK, one moment. [06:23] thanks :P [06:24] arrith: each root is thrown into the discovery queue, which generates more categories or story URLs [06:24] dnova: that one and How To Think Like A Computer Scientist [06:24] arrith: from there, categories are sent to the discovery queue, story URLs are sent to the grab queue [06:24] that, ust now, was more activity in 2 min, than this feed hashad in a week [06:24] arrith: there's four executors for each queue, and four HTTP connections shared amongst all queues [06:25] it's similar in structure to what one might do with the multiprocessing package in python [06:25] just different names. [06:25] this looks great, I'm going to check it out, thanks arrith. [06:25] dnova: good :) [06:25] yipdw: hmm yeah i'm hoping that's not too difficult to translate into python [06:25] arrith: it shouldn't be, Python has much of the same tools [06:26] one second [06:26] updating support.rb with smarter logic [06:26] Wyatt|Wor: sounds about right [06:31] while were all here, has anyone checked out storyinator.sh from here, www.tracker.archive.org/ffnet [06:32] alrighty [06:32] https://gist.github.com/1432483/cdbfa4c8e9779e009838235da543fc0a08754862 [06:32] Oh? Hey now [06:32] bsmith094: I'm getting 404 [06:33] Or not even 404 [06:33] http://tracker.archive.org/ff.net [06:33] wrong link [06:33] AH [06:34] Wyatt|Wor: that's a bit of underscor's work so far [06:34] yeah [06:34] yeah i know [06:34] mk [06:35] just he's further along and it's a non portable proof of concept atm [06:37] yipdw: do you generally prefer ruby to python for quick projects? [06:37] non....portable? But it runs on anything with a bash interpreter... [06:37] ;) [06:38] arrith: I've used Ruby more recently [06:38] so I find it easier to express programs in it [06:38] I have nothing against Python, though; I usually use it to script Blender [06:38] no complaints about Python there [06:38] Wyatt|Wor: heh well, requires php. a novice user getting php up and running for a small script isn't the easiest [06:39] i have a layout idea for what to grab for the stories http://pastebin.com/W6tUR1VE [06:39] yipdw: ah, i was wondering if you had experience with both or just knew ruby more [06:39] arrith: both :P [06:40] yipdw: that's exactly why i'm writing things in bash and not python ;) [06:40] eh? [06:40] well [06:40] here's my problem with bash [06:40] the language is arcane as hell, it's not really THAT portable due to lots of differences between shell versions [06:40] and even if you have the same version, the installed utilities can differ [06:40] GNU du does not accept the same options as e.g. BSD du, for instance [06:40] so you end up coding abstractions for stuff like that [06:41] oh yeah i have no defense for any of that [06:41] in the end I've found Python, Perl, Ruby to be more portable than bash :P [06:41] yeah definitely [06:41] i blame it on being 'raised wrong'. it's all i know! [06:41] for now at least [06:41] I love bash for the beauty that comes from some of the ugliest code on the planet. [06:42] ditto [06:42] i can actually follow most of it [06:42] But I'm not going to pretend it's more than glue. [06:42] Moreso than perl, even. [06:43] i've had an unfortunate feedback loop of mainly knowing bash, so i start a project in it then google to fill in areas that i lack, and not just starting over doing the hard very beginning stuff with a new lang [06:43] for my next project, I'll get ArchiveTeam using Factor [06:43] Why use bash when you can use ksh? ;) [06:43] http://factorcode.org/ [06:44] no2p: Actually people ask me this seriously here. I even developed an answer for it: Because Bash is everywhere. [06:44] Oh, no doubt. I was joking in terms of 'looks'. [06:44] so is Java, but that hasn't helped much :P [06:44] well, that's unfair [06:45] in a server context it's fine [06:45] yipdw: But Java is a boilerplate language not a glue language [06:45] re: portability [06:45] Wyatt|Wor: I don't understand the distinction [06:45] hey, i love java for its ubiquity [06:45] geh [06:46] yipdw: Java you spend most of your time writing long strings of boilerplate code. Bash, you spend a lot of time gluing other things together until it does what you want. [06:46] not saying its good, or fast, but a jar will run on anything with a jvm [06:46] stop with all the esoteric languages. for the distributed downloading stuff, it should use a well-featured and widly-installed language (like python or perl) [06:46] I wasn't serious about Factor [06:46] nor me with java [06:46] yipdw: AT needs easier code not harder code :P [06:47] **shudder** [06:47]