#archiveteam 2012-09-04,Tue

↑back Search

Time	Nickname	Message
00:14 ^🔗	SketchCow	alard: It makes me nervous when you say a site is large.
00:21 ^🔗	illunatic	a political song if you like http://blog.greenpirate.org/voters-lament-song-debut-from-grandpa-matt/
00:21 ^🔗	illunatic	nigh-partisan
01:44 ^🔗	chronomex	alard: do you delete projects from the tracker database when they're done?
01:44 ^🔗	chronomex	that makes the long-term graphs disappear :(
01:49 ^🔗	SketchCow	Maybe we need to think about a way to "freeze"
01:54 ^🔗	Aragan	Alright, back.
03:07 ^🔗	Nintendud	Tracker rate limiting is in effect? Seems like some of my warrior threads can't get work.
03:08 ^🔗	chronomex	remember this is probably a single host we're hammering.
03:09 ^🔗	Nintendud	Their forums are hosted on a single host? Aw.
03:09 ^🔗	Nintendud	Well, as long as it is intentional limiting
03:10 ^🔗	chronomex	I don't actually know
03:52 ^🔗	S[h]O[r]T	its intentional
03:57 ^🔗	S[h]O[r]T	@alard i dont mind some diy, i can always try. what exactly are the non thread pages? do you mean the forumdisplay and /archive stuff?
04:37 ^🔗	illunatic	http://blog.greenpirate.org/hugo-awards-censored-by-copyright-enforcement-ai/
04:37 ^🔗	illunatic	how copyright will destroy us all^
05:42 ^🔗	illunatic	http://blog.greenpirate.org/lolnews/
06:54 ^🔗	Coderjoe	illunatic: off-topic. use #archiveteam-bs for that, please.
07:13 ^🔗	alard	S[h]O[r]T: Yes, the /index.php, the /forumdisplay.php pages and the announcements. You could use the wget-lua script for that, I think. Comment out the lines that lead from forumdisplay to the threads, then run wget on the list of urls /index.php + /forumdisplay.php?f= with each of the forum/subforum IDs.
07:14 ^🔗	alard	With page-requisites, but without mirror.
07:15 ^🔗	alard	I hadn't seen the archive: http://boards.cityofheroes.com/archive/index.php/ These are just copies of what's on the real forum, right?
07:15 ^🔗	chronomex	check for the purged ones maybe?
07:16 ^🔗	alard	chronomex: I usually remove the finished projects, yes. I download the log file and clear the memory.
07:16 ^🔗	chronomex	hrm
07:16 ^🔗	chronomex	I might have to change my graphing adapter thing then
07:17 ^🔗	Aragan	Oh whoa.
07:17 ^🔗	Aragan	alard, the archive looks like it's from 2004 o_o
07:17 ^🔗	Aragan	I never saw this.
07:17 ^🔗	Aragan	Wait--no.
07:18 ^🔗	Aragan	It's ordered from oldest first, to newest last.
07:18 ^🔗	alard	chronomex: Sorry, I didn't know about that. I thought it just remembered the old values. Still, it's probably a good idea to remove the old data from Redis.
07:18 ^🔗	Aragan	http://boards.cityofheroes.com/archive/index.php/t-296680.html <- This was posted on the forums within the past few hours.
07:18 ^🔗	chronomex	oh it remembers the values in the rrd files, but they don't render unless the script outputs the name, which it gets from the database
07:19 ^🔗	alard	Aragan: It could be a search engine thing, for search engines that don't like query strings.
07:19 ^🔗	alard	Every page has a link back to the 'full version' too.
07:21 ^🔗	alard	I don't think there's anything with IDs under 100,000.
09:33 ^🔗	godane	looks to be another 3000 urls and theblaze.com stories will be backup
10:45 ^🔗	godane	1100 urls to go
10:48 ^🔗	godane	i think forum post on archive.org should be locked after a year or you get spam: http://archive.org/post/339794/merry-christmas-everybody
12:34 ^🔗	godane	i got all stories from theblaze.com
12:35 ^🔗	godane	now i'm going to look at getting all theblaze.com/wp-content/uploads/ files
12:35 ^🔗	godane	which are images, pdfs, maybe zips and html pages
13:06 ^🔗	godane	i'm getting everything in theblaze.com/wp-content/ folder
14:46 ^🔗	illunatic	Coderjoe: sure
15:12 ^🔗	godane	the images from theblaze are very big
15:12 ^🔗	godane	i have 18 warc.gz so far at about 100mb
15:48 ^🔗	DFJustin	Swizzle: whoah lotta games going in now
15:48 ^🔗	DFJustin	should they all have the _1020 suffix?
15:49 ^🔗	Swizzle	DFJustin: Yea - Nemo_bis showed me https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader so I'm burning through the collection. Will need to do a QA pass on a bunch to add descriptions, but most of the content is going up automatically at least
15:50 ^🔗	Swizzle	Yea, I was lazy and "randomized" the id's by adding _1020 to the end of them
15:50 ^🔗	DFJustin	hehe
15:50 ^🔗	Swizzle	I found that if the id matched one already in the database the uploader would go crazy on me
15:51 ^🔗	Nemo_bis	hmm this shouldn't happen
15:51 ^🔗	DFJustin	btw something you may want to enable on the collection, adding a property called show_search_by_year set to "true" gives you a "browse by year" link
15:54 ^🔗	DFJustin	the s3 uploader only sets the date field and not the year though unless you specify both :(
15:54 ^🔗	Swizzle	Awesome! I've gone ahead and added it. Although I only specified the date field. Does that search only the year field?
15:55 ^🔗	DFJustin	yeah
15:56 ^🔗	Swizzle	Ouch. Well at least I can change the csv for the remaining files. I will need to fix up what I did last night
15:57 ^🔗	DFJustin	just doing an edit/save with the web interface sets the year field from the date automatically
15:58 ^🔗	DFJustin	they just haven't hooked that up on the s3 side
15:58 ^🔗	Swizzle	Yea, I've noticed that before. I've never understood why they have both fields so I just chose one when I did my csv file. I'm kicking myself now for just not including both
16:02 ^🔗	DFJustin	you can bulk-fix using metamgr.php, should probably move discussion to #archiveteam-bs though as it's a little off-topic
16:40 ^🔗	swebb	FYI: I'm going to remove the 'textfiles' query from the #archiveteam-twitter channel.
16:57 ^🔗	alard	swebb (or anyone else who manages the various batsignals): There's a new project on the warrior.
16:58 ^🔗	swebb	What's the project?
16:58 ^🔗	alard	http://tracker.archiveteam.org/cityofheroes/
16:58 ^🔗	alard	http://boards.cityofheroes.com/
16:58 ^🔗	alard	It may or may not disappear sooner or later.
17:00 ^🔗	alard	There's a bit of "realignment of company focus" and "celebrating the legacy" going on.
17:00 ^🔗	DFJustin	is it doing --page-requisites grabs to get offsite images
17:00 ^🔗	alard	DFJustin: Certainly.
17:01 ^🔗	alard	Here's the full Wget command line: https://github.com/ArchiveTeam/cityofheroes-grab/blob/master/pipeline.py#L75-93
17:03 ^🔗	DFJustin	:D
17:11 ^🔗	alard	Maybe we should also make a copy of http://www.cityofheroes.com/, and perhaps even the fan sites: http://na.cityofheroes.com/en/community/fan_sites/fan_sites.php
17:17 ^🔗	swebb	Looks like the rsync server may be having some problems. I'm getting stalls when uploading using the warrior.
17:24 ^🔗	swebb	NM. Better now
17:56 ^🔗	Schbirid	does someone have some bash snippet to turn any string (eg a filename) into a archive.org item name compatible string? replacing spaces with underscores etc
17:57 ^🔗	Famicoman	yessir
17:57 ^🔗	Schbirid	gimmeh
17:58 ^🔗	Famicoman	https://gist.github.com/3391205
17:58 ^🔗	Famicoman	note, it also does uppercase to lowercase
17:58 ^🔗	Famicoman	which isn't required
17:58 ^🔗	Famicoman	but you can edit that out if you want
17:58 ^🔗	Schbirid	perfect, thanks
17:59 ^🔗	Famicoman	np
19:04 ^🔗	berndj	on a local freecycle list, "30+ old cds (mostly 90's software), ...", is that something that interests you folk? also "a bag of floppies" (no indication what's on those)
19:08 ^🔗	Schbirid	Famicoman: tr -d '[{}(),\!:?~@#$%^&*()+=;<>\|]' <- has () two times, oversight?
19:12 ^🔗	DFJustin	berndj: sure is
19:12 ^🔗	DFJustin	SketchCow is the one usually accumulating that kind of thing but he's probably afk enjoying manchester
19:13 ^🔗	Famicoman	total oversight, and I bet some symbols in there aren't filename worthy
19:17 ^🔗	Schbirid	ok
19:29 ^🔗	Schbirid	does IA decide on deriving based on the user-specified mediatype? or does it check what uploaded files actually are?
19:31 ^🔗	DFJustin	afaik the mediatype doesn't influence the derive, just the presentation of the page and which collections it shows up in
19:32 ^🔗	Schbirid	ok
19:37 ^🔗	Coderjoe	I also have accumulated old media, but it is mostly from my own flotsam
20:26 ^🔗	ersi	Schbirid: #internetarchive :)
20:26 ^🔗	Schbirid	seriously?
20:26 ^🔗	ersi	underscor started a support hang about
20:26 ^🔗	ersi	ya
20:26 ^🔗	ersi	I mean, I'm not saying you're off topic
20:26 ^🔗	Schbirid	sigh
20:27 ^🔗	ersi	it's also for people who aren't associated to archiveteam
20:27 ^🔗	ersi	however unlikely that might be
20:27 ^🔗	Schbirid	ok
20:27 ^🔗	ersi	imo it's a great idea ^_^
20:37 ^🔗	underscor	<3

irclogger-viewer