[00:14] alard: It makes me nervous when you say a site is large. [00:21] a political song if you like http://blog.greenpirate.org/voters-lament-song-debut-from-grandpa-matt/ [00:21] nigh-partisan [01:44] alard: do you delete projects from the tracker database when they're done? [01:44] that makes the long-term graphs disappear :( [01:49] Maybe we need to think about a way to "freeze" [01:54] Alright, back. [03:07] Tracker rate limiting is in effect? Seems like some of my warrior threads can't get work. [03:08] remember this is probably a single host we're hammering. [03:09] Their forums are hosted on a single host? Aw. [03:09] Well, as long as it is intentional limiting [03:10] I don't actually know [03:52] its intentional [03:57] @alard i dont mind some diy, i can always try. what exactly are the non thread pages? do you mean the forumdisplay and /archive stuff? [04:37] http://blog.greenpirate.org/hugo-awards-censored-by-copyright-enforcement-ai/ [04:37] how copyright will destroy us all^ [05:42] http://blog.greenpirate.org/lolnews/ [06:54] illunatic: off-topic. use #archiveteam-bs for that, please. [07:13] S[h]O[r]T: Yes, the /index.php, the /forumdisplay.php pages and the announcements. You could use the wget-lua script for that, I think. Comment out the lines that lead from forumdisplay to the threads, then run wget on the list of urls /index.php + /forumdisplay.php?f= with each of the forum/subforum IDs. [07:14] With page-requisites, but without mirror. [07:15] I hadn't seen the archive: http://boards.cityofheroes.com/archive/index.php/ These are just copies of what's on the real forum, right? [07:15] check for the purged ones maybe? [07:16] chronomex: I usually remove the finished projects, yes. I download the log file and clear the memory. [07:16] hrm [07:16] I might have to change my graphing adapter thing then [07:17] Oh whoa. [07:17] alard, the archive looks like it's from 2004 o_o [07:17] I never saw this. [07:17] Wait--no. [07:18] It's ordered from oldest first, to newest last. [07:18] chronomex: Sorry, I didn't know about that. I thought it just remembered the old values. Still, it's probably a good idea to remove the old data from Redis. [07:18] http://boards.cityofheroes.com/archive/index.php/t-296680.html <- This was posted on the forums within the past few hours. [07:18] oh it remembers the values in the rrd files, but they don't render unless the script outputs the name, which it gets from the database [07:19] Aragan: It could be a search engine thing, for search engines that don't like query strings. [07:19] Every page has a link back to the 'full version' too. [07:21] I don't think there's anything with IDs under 100,000. [09:33] looks to be another 3000 urls and theblaze.com stories will be backup [10:45] 1100 urls to go [10:48] i think forum post on archive.org should be locked after a year or you get spam: http://archive.org/post/339794/merry-christmas-everybody [12:34] i got all stories from theblaze.com [12:35] now i'm going to look at getting all theblaze.com/wp-content/uploads/ files [12:35] which are images, pdfs, maybe zips and html pages [13:06] i'm getting everything in theblaze.com/wp-content/ folder [14:46] Coderjoe: sure [15:12] the images from theblaze are very big [15:12] i have 18 warc.gz so far at about 100mb [15:48] Swizzle: whoah lotta games going in now [15:48] should they all have the _1020 suffix? [15:49] DFJustin: Yea - Nemo_bis showed me https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader so I'm burning through the collection. Will need to do a QA pass on a bunch to add descriptions, but most of the content is going up automatically at least [15:50] Yea, I was lazy and "randomized" the id's by adding _1020 to the end of them [15:50] hehe [15:50] I found that if the id matched one already in the database the uploader would go crazy on me [15:51] hmm this shouldn't happen [15:51] btw something you may want to enable on the collection, adding a property called show_search_by_year set to "true" gives you a "browse by year" link [15:54] the s3 uploader only sets the date field and not the year though unless you specify both :( [15:54] Awesome! I've gone ahead and added it. Although I only specified the date field. Does that search only the year field? [15:55] yeah [15:56] Ouch. Well at least I can change the csv for the remaining files. I will need to fix up what I did last night [15:57] just doing an edit/save with the web interface sets the year field from the date automatically [15:58] they just haven't hooked that up on the s3 side [15:58] Yea, I've noticed that before. I've never understood why they have both fields so I just chose one when I did my csv file. I'm kicking myself now for just not including both [16:02] you can bulk-fix using metamgr.php, should probably move discussion to #archiveteam-bs though as it's a little off-topic [16:40] FYI: I'm going to remove the 'textfiles' query from the #archiveteam-twitter channel. [16:57] swebb (or anyone else who manages the various batsignals): There's a new project on the warrior. [16:58] What's the project? [16:58] http://tracker.archiveteam.org/cityofheroes/ [16:58] http://boards.cityofheroes.com/ [16:58] It may or may not disappear sooner or later. [17:00] There's a bit of "realignment of company focus" and "celebrating the legacy" going on. [17:00] is it doing --page-requisites grabs to get offsite images [17:00] DFJustin: Certainly. [17:01] Here's the full Wget command line: https://github.com/ArchiveTeam/cityofheroes-grab/blob/master/pipeline.py#L75-93 [17:03] :D [17:11] Maybe we should also make a copy of http://www.cityofheroes.com/, and perhaps even the fan sites: http://na.cityofheroes.com/en/community/fan_sites/fan_sites.php [17:17] Looks like the rsync server may be having some problems. I'm getting stalls when uploading using the warrior. [17:24] NM. Better now [17:56] does someone have some bash snippet to turn any string (eg a filename) into a archive.org item name compatible string? replacing spaces with underscores etc [17:57] yessir [17:57] gimmeh [17:58] https://gist.github.com/3391205 [17:58] note, it also does uppercase to lowercase [17:58] which isn't required [17:58] but you can edit that out if you want [17:58] perfect, thanks [17:59] np [19:04] on a local freecycle list, "30+ old cds (mostly 90's software), ...", is that something that interests you folk? also "a bag of floppies" (no indication what's on those) [19:08] Famicoman: tr -d '[{}(),\!:?~@#$%^&*()+=;<>|]' <- has () two times, oversight? [19:12] berndj: sure is [19:12] SketchCow is the one usually accumulating that kind of thing but he's probably afk enjoying manchester [19:13] total oversight, and I bet some symbols in there aren't filename worthy [19:17] ok [19:29] does IA decide on deriving based on the user-specified mediatype? or does it check what uploaded files actually are? [19:31] afaik the mediatype doesn't influence the derive, just the presentation of the page and which collections it shows up in [19:32] ok [19:37] I also have accumulated old media, but it is mostly from my own flotsam [20:26] Schbirid: #internetarchive :) [20:26] seriously? [20:26] underscor started a support hang about [20:26] ya [20:26] I mean, I'm not saying you're off topic [20:26] sigh [20:27] it's also for people who aren't associated to archiveteam [20:27] however unlikely that might be [20:27] ok [20:27] imo it's a great idea ^_^ [20:37] <3