#archiveteam 2012-09-04,Tue

↑back Search

Time Nickname Message
00:14 🔗 SketchCow alard: It makes me nervous when you say a site is large.
00:21 🔗 illunatic a political song if you like http://blog.greenpirate.org/voters-lament-song-debut-from-grandpa-matt/
00:21 🔗 illunatic nigh-partisan
01:44 🔗 chronomex alard: do you delete projects from the tracker database when they're done?
01:44 🔗 chronomex that makes the long-term graphs disappear :(
01:49 🔗 SketchCow Maybe we need to think about a way to "freeze"
01:54 🔗 Aragan Alright, back.
03:07 🔗 Nintendud Tracker rate limiting is in effect? Seems like some of my warrior threads can't get work.
03:08 🔗 chronomex remember this is probably a single host we're hammering.
03:09 🔗 Nintendud Their forums are hosted on a single host? Aw.
03:09 🔗 Nintendud Well, as long as it is intentional limiting
03:10 🔗 chronomex I don't actually know
03:52 🔗 S[h]O[r]T its intentional
03:57 🔗 S[h]O[r]T @alard i dont mind some diy, i can always try. what exactly are the non thread pages? do you mean the forumdisplay and /archive stuff?
04:37 🔗 illunatic http://blog.greenpirate.org/hugo-awards-censored-by-copyright-enforcement-ai/
04:37 🔗 illunatic how copyright will destroy us all^
05:42 🔗 illunatic http://blog.greenpirate.org/lolnews/
06:54 🔗 Coderjoe illunatic: off-topic. use #archiveteam-bs for that, please.
07:13 🔗 alard S[h]O[r]T: Yes, the /index.php, the /forumdisplay.php pages and the announcements. You could use the wget-lua script for that, I think. Comment out the lines that lead from forumdisplay to the threads, then run wget on the list of urls /index.php + /forumdisplay.php?f= with each of the forum/subforum IDs.
07:14 🔗 alard With page-requisites, but without mirror.
07:15 🔗 alard I hadn't seen the archive: http://boards.cityofheroes.com/archive/index.php/ These are just copies of what's on the real forum, right?
07:15 🔗 chronomex check for the purged ones maybe?
07:16 🔗 alard chronomex: I usually remove the finished projects, yes. I download the log file and clear the memory.
07:16 🔗 chronomex hrm
07:16 🔗 chronomex I might have to change my graphing adapter thing then
07:17 🔗 Aragan Oh whoa.
07:17 🔗 Aragan alard, the archive looks like it's from 2004 o_o
07:17 🔗 Aragan I never saw this.
07:17 🔗 Aragan Wait--no.
07:18 🔗 Aragan It's ordered from oldest first, to newest last.
07:18 🔗 alard chronomex: Sorry, I didn't know about that. I thought it just remembered the old values. Still, it's probably a good idea to remove the old data from Redis.
07:18 🔗 Aragan http://boards.cityofheroes.com/archive/index.php/t-296680.html <- This was posted on the forums within the past few hours.
07:18 🔗 chronomex oh it remembers the values in the rrd files, but they don't render unless the script outputs the name, which it gets from the database
07:19 🔗 alard Aragan: It could be a search engine thing, for search engines that don't like query strings.
07:19 🔗 alard Every page has a link back to the 'full version' too.
07:21 🔗 alard I don't think there's anything with IDs under 100,000.
09:33 🔗 godane looks to be another 3000 urls and theblaze.com stories will be backup
10:45 🔗 godane 1100 urls to go
10:48 🔗 godane i think forum post on archive.org should be locked after a year or you get spam: http://archive.org/post/339794/merry-christmas-everybody
12:34 🔗 godane i got all stories from theblaze.com
12:35 🔗 godane now i'm going to look at getting all theblaze.com/wp-content/uploads/ files
12:35 🔗 godane which are images, pdfs, maybe zips and html pages
13:06 🔗 godane i'm getting everything in theblaze.com/wp-content/ folder
14:46 🔗 illunatic Coderjoe: sure
15:12 🔗 godane the images from theblaze are very big
15:12 🔗 godane i have 18 warc.gz so far at about 100mb
15:48 🔗 DFJustin Swizzle: whoah lotta games going in now
15:48 🔗 DFJustin should they all have the _1020 suffix?
15:49 🔗 Swizzle DFJustin: Yea - Nemo_bis showed me https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader so I'm burning through the collection. Will need to do a QA pass on a bunch to add descriptions, but most of the content is going up automatically at least
15:50 🔗 Swizzle Yea, I was lazy and "randomized" the id's by adding _1020 to the end of them
15:50 🔗 DFJustin hehe
15:50 🔗 Swizzle I found that if the id matched one already in the database the uploader would go crazy on me
15:51 🔗 Nemo_bis hmm this shouldn't happen
15:51 🔗 DFJustin btw something you may want to enable on the collection, adding a property called show_search_by_year set to "true" gives you a "browse by year" link
15:54 🔗 DFJustin the s3 uploader only sets the date field and not the year though unless you specify both :(
15:54 🔗 Swizzle Awesome! I've gone ahead and added it. Although I only specified the date field. Does that search only the year field?
15:55 🔗 DFJustin yeah
15:56 🔗 Swizzle Ouch. Well at least I can change the csv for the remaining files. I will need to fix up what I did last night
15:57 🔗 DFJustin just doing an edit/save with the web interface sets the year field from the date automatically
15:58 🔗 DFJustin they just haven't hooked that up on the s3 side
15:58 🔗 Swizzle Yea, I've noticed that before. I've never understood why they have both fields so I just chose one when I did my csv file. I'm kicking myself now for just not including both
16:02 🔗 DFJustin you can bulk-fix using metamgr.php, should probably move discussion to #archiveteam-bs though as it's a little off-topic
16:40 🔗 swebb FYI: I'm going to remove the 'textfiles' query from the #archiveteam-twitter channel.
16:57 🔗 alard swebb (or anyone else who manages the various batsignals): There's a new project on the warrior.
16:58 🔗 swebb What's the project?
16:58 🔗 alard http://tracker.archiveteam.org/cityofheroes/
16:58 🔗 alard http://boards.cityofheroes.com/
16:58 🔗 alard It may or may not disappear sooner or later.
17:00 🔗 alard There's a bit of "realignment of company focus" and "celebrating the legacy" going on.
17:00 🔗 DFJustin is it doing --page-requisites grabs to get offsite images
17:00 🔗 alard DFJustin: Certainly.
17:01 🔗 alard Here's the full Wget command line: https://github.com/ArchiveTeam/cityofheroes-grab/blob/master/pipeline.py#L75-93
17:03 🔗 DFJustin :D
17:11 🔗 alard Maybe we should also make a copy of http://www.cityofheroes.com/, and perhaps even the fan sites: http://na.cityofheroes.com/en/community/fan_sites/fan_sites.php
17:17 🔗 swebb Looks like the rsync server may be having some problems. I'm getting stalls when uploading using the warrior.
17:24 🔗 swebb NM. Better now
17:56 🔗 Schbirid does someone have some bash snippet to turn any string (eg a filename) into a archive.org item name compatible string? replacing spaces with underscores etc
17:57 🔗 Famicoman yessir
17:57 🔗 Schbirid gimmeh
17:58 🔗 Famicoman https://gist.github.com/3391205
17:58 🔗 Famicoman note, it also does uppercase to lowercase
17:58 🔗 Famicoman which isn't required
17:58 🔗 Famicoman but you can edit that out if you want
17:58 🔗 Schbirid perfect, thanks
17:59 🔗 Famicoman np
19:04 🔗 berndj on a local freecycle list, "30+ old cds (mostly 90's software), ...", is that something that interests you folk? also "a bag of floppies" (no indication what's on those)
19:08 🔗 Schbirid Famicoman: tr -d '[{}(),\!:?~@#$%^&*()+=;<>|]' <- has () two times, oversight?
19:12 🔗 DFJustin berndj: sure is
19:12 🔗 DFJustin SketchCow is the one usually accumulating that kind of thing but he's probably afk enjoying manchester
19:13 🔗 Famicoman total oversight, and I bet some symbols in there aren't filename worthy
19:17 🔗 Schbirid ok
19:29 🔗 Schbirid does IA decide on deriving based on the user-specified mediatype? or does it check what uploaded files actually are?
19:31 🔗 DFJustin afaik the mediatype doesn't influence the derive, just the presentation of the page and which collections it shows up in
19:32 🔗 Schbirid ok
19:37 🔗 Coderjoe I also have accumulated old media, but it is mostly from my own flotsam
20:26 🔗 ersi Schbirid: #internetarchive :)
20:26 🔗 Schbirid seriously?
20:26 🔗 ersi underscor started a support hang about
20:26 🔗 ersi ya
20:26 🔗 ersi I mean, I'm not saying you're off topic
20:26 🔗 Schbirid sigh
20:27 🔗 ersi it's also for people who aren't associated to archiveteam
20:27 🔗 ersi however unlikely that might be
20:27 🔗 Schbirid ok
20:27 🔗 ersi imo it's a great idea ^_^
20:37 🔗 underscor <3

irclogger-viewer