Time |
Nickname |
Message |
00:14
🔗
|
SketchCow |
alard: It makes me nervous when you say a site is large. |
00:21
🔗
|
illunatic |
a political song if you like http://blog.greenpirate.org/voters-lament-song-debut-from-grandpa-matt/ |
00:21
🔗
|
illunatic |
nigh-partisan |
01:44
🔗
|
chronomex |
alard: do you delete projects from the tracker database when they're done? |
01:44
🔗
|
chronomex |
that makes the long-term graphs disappear :( |
01:49
🔗
|
SketchCow |
Maybe we need to think about a way to "freeze" |
01:54
🔗
|
Aragan |
Alright, back. |
03:07
🔗
|
Nintendud |
Tracker rate limiting is in effect? Seems like some of my warrior threads can't get work. |
03:08
🔗
|
chronomex |
remember this is probably a single host we're hammering. |
03:09
🔗
|
Nintendud |
Their forums are hosted on a single host? Aw. |
03:09
🔗
|
Nintendud |
Well, as long as it is intentional limiting |
03:10
🔗
|
chronomex |
I don't actually know |
03:52
🔗
|
S[h]O[r]T |
its intentional |
03:57
🔗
|
S[h]O[r]T |
@alard i dont mind some diy, i can always try. what exactly are the non thread pages? do you mean the forumdisplay and /archive stuff? |
04:37
🔗
|
illunatic |
http://blog.greenpirate.org/hugo-awards-censored-by-copyright-enforcement-ai/ |
04:37
🔗
|
illunatic |
how copyright will destroy us all^ |
05:42
🔗
|
illunatic |
http://blog.greenpirate.org/lolnews/ |
06:54
🔗
|
Coderjoe |
illunatic: off-topic. use #archiveteam-bs for that, please. |
07:13
🔗
|
alard |
S[h]O[r]T: Yes, the /index.php, the /forumdisplay.php pages and the announcements. You could use the wget-lua script for that, I think. Comment out the lines that lead from forumdisplay to the threads, then run wget on the list of urls /index.php + /forumdisplay.php?f= with each of the forum/subforum IDs. |
07:14
🔗
|
alard |
With page-requisites, but without mirror. |
07:15
🔗
|
alard |
I hadn't seen the archive: http://boards.cityofheroes.com/archive/index.php/ These are just copies of what's on the real forum, right? |
07:15
🔗
|
chronomex |
check for the purged ones maybe? |
07:16
🔗
|
alard |
chronomex: I usually remove the finished projects, yes. I download the log file and clear the memory. |
07:16
🔗
|
chronomex |
hrm |
07:16
🔗
|
chronomex |
I might have to change my graphing adapter thing then |
07:17
🔗
|
Aragan |
Oh whoa. |
07:17
🔗
|
Aragan |
alard, the archive looks like it's from 2004 o_o |
07:17
🔗
|
Aragan |
I never saw this. |
07:17
🔗
|
Aragan |
Wait--no. |
07:18
🔗
|
Aragan |
It's ordered from oldest first, to newest last. |
07:18
🔗
|
alard |
chronomex: Sorry, I didn't know about that. I thought it just remembered the old values. Still, it's probably a good idea to remove the old data from Redis. |
07:18
🔗
|
Aragan |
http://boards.cityofheroes.com/archive/index.php/t-296680.html <- This was posted on the forums within the past few hours. |
07:18
🔗
|
chronomex |
oh it remembers the values in the rrd files, but they don't render unless the script outputs the name, which it gets from the database |
07:19
🔗
|
alard |
Aragan: It could be a search engine thing, for search engines that don't like query strings. |
07:19
🔗
|
alard |
Every page has a link back to the 'full version' too. |
07:21
🔗
|
alard |
I don't think there's anything with IDs under 100,000. |
09:33
🔗
|
godane |
looks to be another 3000 urls and theblaze.com stories will be backup |
10:45
🔗
|
godane |
1100 urls to go |
10:48
🔗
|
godane |
i think forum post on archive.org should be locked after a year or you get spam: http://archive.org/post/339794/merry-christmas-everybody |
12:34
🔗
|
godane |
i got all stories from theblaze.com |
12:35
🔗
|
godane |
now i'm going to look at getting all theblaze.com/wp-content/uploads/ files |
12:35
🔗
|
godane |
which are images, pdfs, maybe zips and html pages |
13:06
🔗
|
godane |
i'm getting everything in theblaze.com/wp-content/ folder |
14:46
🔗
|
illunatic |
Coderjoe: sure |
15:12
🔗
|
godane |
the images from theblaze are very big |
15:12
🔗
|
godane |
i have 18 warc.gz so far at about 100mb |
15:48
🔗
|
DFJustin |
Swizzle: whoah lotta games going in now |
15:48
🔗
|
DFJustin |
should they all have the _1020 suffix? |
15:49
🔗
|
Swizzle |
DFJustin: Yea - Nemo_bis showed me https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader so I'm burning through the collection. Will need to do a QA pass on a bunch to add descriptions, but most of the content is going up automatically at least |
15:50
🔗
|
Swizzle |
Yea, I was lazy and "randomized" the id's by adding _1020 to the end of them |
15:50
🔗
|
DFJustin |
hehe |
15:50
🔗
|
Swizzle |
I found that if the id matched one already in the database the uploader would go crazy on me |
15:51
🔗
|
Nemo_bis |
hmm this shouldn't happen |
15:51
🔗
|
DFJustin |
btw something you may want to enable on the collection, adding a property called show_search_by_year set to "true" gives you a "browse by year" link |
15:54
🔗
|
DFJustin |
the s3 uploader only sets the date field and not the year though unless you specify both :( |
15:54
🔗
|
Swizzle |
Awesome! I've gone ahead and added it. Although I only specified the date field. Does that search only the year field? |
15:55
🔗
|
DFJustin |
yeah |
15:56
🔗
|
Swizzle |
Ouch. Well at least I can change the csv for the remaining files. I will need to fix up what I did last night |
15:57
🔗
|
DFJustin |
just doing an edit/save with the web interface sets the year field from the date automatically |
15:58
🔗
|
DFJustin |
they just haven't hooked that up on the s3 side |
15:58
🔗
|
Swizzle |
Yea, I've noticed that before. I've never understood why they have both fields so I just chose one when I did my csv file. I'm kicking myself now for just not including both |
16:02
🔗
|
DFJustin |
you can bulk-fix using metamgr.php, should probably move discussion to #archiveteam-bs though as it's a little off-topic |
16:40
🔗
|
swebb |
FYI: I'm going to remove the 'textfiles' query from the #archiveteam-twitter channel. |
16:57
🔗
|
alard |
swebb (or anyone else who manages the various batsignals): There's a new project on the warrior. |
16:58
🔗
|
swebb |
What's the project? |
16:58
🔗
|
alard |
http://tracker.archiveteam.org/cityofheroes/ |
16:58
🔗
|
alard |
http://boards.cityofheroes.com/ |
16:58
🔗
|
alard |
It may or may not disappear sooner or later. |
17:00
🔗
|
alard |
There's a bit of "realignment of company focus" and "celebrating the legacy" going on. |
17:00
🔗
|
DFJustin |
is it doing --page-requisites grabs to get offsite images |
17:00
🔗
|
alard |
DFJustin: Certainly. |
17:01
🔗
|
alard |
Here's the full Wget command line: https://github.com/ArchiveTeam/cityofheroes-grab/blob/master/pipeline.py#L75-93 |
17:03
🔗
|
DFJustin |
:D |
17:11
🔗
|
alard |
Maybe we should also make a copy of http://www.cityofheroes.com/, and perhaps even the fan sites: http://na.cityofheroes.com/en/community/fan_sites/fan_sites.php |
17:17
🔗
|
swebb |
Looks like the rsync server may be having some problems. I'm getting stalls when uploading using the warrior. |
17:24
🔗
|
swebb |
NM. Better now |
17:56
🔗
|
Schbirid |
does someone have some bash snippet to turn any string (eg a filename) into a archive.org item name compatible string? replacing spaces with underscores etc |
17:57
🔗
|
Famicoman |
yessir |
17:57
🔗
|
Schbirid |
gimmeh |
17:58
🔗
|
Famicoman |
https://gist.github.com/3391205 |
17:58
🔗
|
Famicoman |
note, it also does uppercase to lowercase |
17:58
🔗
|
Famicoman |
which isn't required |
17:58
🔗
|
Famicoman |
but you can edit that out if you want |
17:58
🔗
|
Schbirid |
perfect, thanks |
17:59
🔗
|
Famicoman |
np |
19:04
🔗
|
berndj |
on a local freecycle list, "30+ old cds (mostly 90's software), ...", is that something that interests you folk? also "a bag of floppies" (no indication what's on those) |
19:08
🔗
|
Schbirid |
Famicoman: tr -d '[{}(),\!:?~@#$%^&*()+=;<>|]' <- has () two times, oversight? |
19:12
🔗
|
DFJustin |
berndj: sure is |
19:12
🔗
|
DFJustin |
SketchCow is the one usually accumulating that kind of thing but he's probably afk enjoying manchester |
19:13
🔗
|
Famicoman |
total oversight, and I bet some symbols in there aren't filename worthy |
19:17
🔗
|
Schbirid |
ok |
19:29
🔗
|
Schbirid |
does IA decide on deriving based on the user-specified mediatype? or does it check what uploaded files actually are? |
19:31
🔗
|
DFJustin |
afaik the mediatype doesn't influence the derive, just the presentation of the page and which collections it shows up in |
19:32
🔗
|
Schbirid |
ok |
19:37
🔗
|
Coderjoe |
I also have accumulated old media, but it is mostly from my own flotsam |
20:26
🔗
|
ersi |
Schbirid: #internetarchive :) |
20:26
🔗
|
Schbirid |
seriously? |
20:26
🔗
|
ersi |
underscor started a support hang about |
20:26
🔗
|
ersi |
ya |
20:26
🔗
|
ersi |
I mean, I'm not saying you're off topic |
20:26
🔗
|
Schbirid |
sigh |
20:27
🔗
|
ersi |
it's also for people who aren't associated to archiveteam |
20:27
🔗
|
ersi |
however unlikely that might be |
20:27
🔗
|
Schbirid |
ok |
20:27
🔗
|
ersi |
imo it's a great idea ^_^ |
20:37
🔗
|
underscor |
<3 |