Time |
Nickname |
Message |
00:25
🔗
|
DFJustin |
https://archive.org/details/archive.pdp11.org.ru-20130504 |
00:44
🔗
|
ivan` |
it looks like Feed API hits URLs outside /reader/ and is a lot more annoying to use |
02:54
🔗
|
SketchCow |
Slightly redid the page, nothing earth-shattering. |
05:27
🔗
|
SketchCow |
Time to set up some floppy scannin'! |
05:28
🔗
|
SketchCow |
I'm going to start with Compute! floppies, work from there. |
07:52
🔗
|
Nemo_bis |
aww floppies |
07:53
🔗
|
Nemo_bis |
number of (media) files on Wikimedia Foundation servers: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&m=swift_object_count&h=Swift+pmtpa+prod&c=Swift+pmtpa&trend=1 |
07:53
🔗
|
Nemo_bis |
mostly thumbnails |
13:21
🔗
|
omf_ |
I uploaded the first block of 10,000 screenshots that are finished - http://archive.org/details/posterous_screens_01 |
13:21
🔗
|
omf_ |
The item contains a tar of pngs, a log file and the list of urls used |
13:22
🔗
|
godane |
i want your script so i can do that |
13:22
🔗
|
omf_ |
What other information should I include? |
14:41
🔗
|
omf_ |
SketchCow, when you have a minute could you flip this from software to texts Mediatype. I goofed the first one - http://archive.org/details/posterous_screens_01 |
15:00
🔗
|
SketchCow |
Done |
15:05
🔗
|
omf_ |
SketchCow, you flipped the collection to Ebook and text, but the mediatype is still software |
15:09
🔗
|
SketchCow |
So I did, so I did. Fixed. |
15:12
🔗
|
yan |
heads up, shouting coming up! |
15:12
🔗
|
yan |
WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD |
15:12
🔗
|
SketchCow |
HUZZAH GOOD SIR, THE MAGIC WORD OF SECRET IS "yahoosucks" |
15:19
🔗
|
godane |
SketchCow: i uploaded 3 of my scanned magazines |
15:19
🔗
|
godane |
https://archive.org/details/pc-novice-1995-04 |
15:19
🔗
|
godane |
https://archive.org/details/pc-novice-1995-05 |
15:19
🔗
|
godane |
https://archive.org/details/pc-novice-1995-06 |
15:23
🔗
|
yan |
thank ye, kind fere, henceforth I am proud to consider myself a proby among your fine band of warriors! |
15:38
🔗
|
SketchCow |
Why would you make this posterous screens collection text versus software? |
15:52
🔗
|
audy |
what do archive now? |
15:53
🔗
|
godane |
fuck me |
15:54
🔗
|
godane |
looks like spark cbc links are not working for older episodes now |
16:05
🔗
|
godane |
good news is the wayback machine last year took a good archive it looks like of it |
16:23
🔗
|
godane |
so we may get lucky and have a very almost complete collection of cbc spark soon |
16:49
🔗
|
MrArgent |
whoo, 600mb till i've got the Textfiles backup stored locally! |
18:53
🔗
|
berndj |
is there a usenet archive that's free from the shackles of dejanews' new owners? |
19:59
🔗
|
Tux |
berndj: olduse.net? |
20:02
🔗
|
berndj |
ah. as long as the usenet legacy isn't vulnerable to getting geocitied |
21:03
🔗
|
omf_ |
SketchCow, I should be setting this stuff collection "ourmedia", mediatype "image" for the posterous screens. That seems a better fit |
21:22
🔗
|
omf_ |
Then again the screenshot grabs from last year are all mediatype: web |
21:23
🔗
|
omf_ |
If anyone else has an opinion or view on this, I am open to suggestions |
21:23
🔗
|
omf_ |
http://archive.org/details/archiveteam-fortunecity-screenshots for example |
21:24
🔗
|
Smiley |
sounds good to me |
21:27
🔗
|
omf_ |
Smiley, should I be setting them to mediatype image or web? |
21:27
🔗
|
omf_ |
It is a creative commons collection of images that represent a crawl of the web |
21:28
🔗
|
omf_ |
Well the web aspect is there via the webcrawl tag |
21:28
🔗
|
omf_ |
Take note everyone. Metadata is harder than getting the data :) |
21:32
🔗
|
Smiley |
I'd say image. |
21:32
🔗
|
Smiley |
are the metatags used to work out how to display said data? |
21:32
🔗
|
Smiley |
if so, it' needs to be image. |
21:33
🔗
|
omf_ |
IA just mime types and magics files |
21:33
🔗
|
omf_ |
it is the fastest and most common method |
21:34
🔗
|
omf_ |
can you really trust user data? |
21:34
🔗
|
DFJustin |
the mediatype does determine the layout of the item page |
21:34
🔗
|
DFJustin |
e.g. "text" doesn't have links to the raw files unless you click through to the HTTP directory view |
21:35
🔗
|
omf_ |
I was thinking of the data work on the files not the display |
21:35
🔗
|
omf_ |
what DFJustin said ^^ |
21:36
🔗
|
DFJustin |
software/image/web/data all seem to have the same or basically the same layout for now, but that could change down the road |
21:37
🔗
|
DFJustin |
for screenshots I'd probably go image because web is more designed for warcs for the wayback machine, and in the future they may add thumbnail browsing or the like |
22:06
🔗
|
SketchCow |
Just got heads |
22:06
🔗
|
SketchCow |
up |
22:06
🔗
|
SketchCow |
Huge internal fight at pouet.net |
22:06
🔗
|
SketchCow |
We need to grab this thing |
22:07
🔗
|
Smiley |
Awww crap, huge forum |
22:07
🔗
|
omf_ |
The url format is pretty easy - http://pouet.net/topic.php?which=9389&page=1&x=25&y=11 |
22:08
🔗
|
omf_ |
which is a thread |
22:08
🔗
|
omf_ |
page is pagination |
22:08
🔗
|
omf_ |
and the x & y can be left off |
22:09
🔗
|
balrog |
ohshit |
22:09
🔗
|
balrog |
pouet.net is important |
22:09
🔗
|
Smiley |
Doing "normal" wget grab to warc |
22:09
🔗
|
omf_ |
which is also everything else |
22:10
🔗
|
Smiley |
well the server is nice and fast at least. |
22:11
🔗
|
omf_ |
not for long ;) |
22:11
🔗
|
Smiley |
;) |
22:11
🔗
|
omf_ |
This should be pretty easy to run on the warrior |
22:12
🔗
|
Smiley |
get to it then guys |
22:12
🔗
|
Smiley |
:D |
22:12
🔗
|
Smiley |
Not like I know how to setup warrior tasks yet and have far too much to do :( |
22:12
🔗
|
omf_ |
Smiley, For once your not actually bored?!? |
22:12
🔗
|
omf_ |
there is also this url pattern |
22:13
🔗
|
omf_ |
http://pouet.net/download.php?which=55993 |
22:14
🔗
|
Smiley |
omf_: nope, far too mcuh to do! :D |
22:18
🔗
|
omf_ |
wow so I just read up on pouet and what is going on |
22:19
🔗
|
omf_ |
A user wrote them a new code base and is now using it as leverage to make a land grab |
22:19
🔗
|
omf_ |
and the users are freaking out that all the data is going to get closed and erased |
22:21
🔗
|
balrog |
omf_: where is this? |
22:21
🔗
|
chronomex |
as well they should |
22:21
🔗
|
balrog |
if they can't figure out how to migrate the data, they should freeze and archive the old site as read-only. |
22:23
🔗
|
omf_ |
balrog, it has nothing to do with data migration and more to do with who is going to "own" the data in the future |
22:24
🔗
|
Smiley |
like thingiverse? |
22:24
🔗
|
Smiley |
but not quite. |
22:24
🔗
|
omf_ |
It is 10 pages of comments so far |
22:26
🔗
|
omf_ |
Also pouet made the mistake of not opening source the 2.0 from the beginning of the project and this multiyear closed development project is full of ego |
22:26
🔗
|
balrog |
ughhhh |
22:27
🔗
|
omf_ |
They let a coder run wild on his own for years, what did they expect to happen |
22:31
🔗
|
omf_ |
downloading the files is going to be tricky |
22:31
🔗
|
balrog |
omf_: I've heard that FurAffinity has had similar internal political issues for what, the past 2 years? Not that I have an account there, as I don't |
22:36
🔗
|
philpem |
I do. |
22:36
🔗
|
philpem |
The URLs are straightforward incrementing ID numbers. |
22:37
🔗
|
philpem |
Journals and submissions would get you >80% of the "interesting" data. Userpages (linked from submissions) would get you the rest. |
22:37
🔗
|
philpem |
Parse the page and look for usernames in the comments, look who submitted the thing etc. |
22:38
🔗
|
philpem |
The only catch is that they have options for restricting viewing to logged-in users; also anything rated NSFW requires login (and account set to "allow adult content") to see it. |
22:38
🔗
|
balrog |
I do know that they deliberately block all robots including googlebot |
22:39
🔗
|
philpem |
Well yeah, robots.txt |
22:39
🔗
|
balrog |
well yeah that's what I meant |
22:39
🔗
|
philpem |
They used to allow them until Dragoneer had a hissy fit about it |
22:39
🔗
|
omf_ |
Robots.txt does not mean shit, when I think blocking I think of the shit google and yahoo do |
22:40
🔗
|
philpem |
Deviantart, Inkbunny, SoFurry, Weasyl and Nabyn all allow bots in. Better to let them index the site, then you can find stuff with internet searches. (but we're straying off topic) |
22:40
🔗
|
HeaD_ |
greped over http://pouet.net/groups.php?pattern=[a-z] |
22:41
🔗
|
HeaD_ |
got 61243 prod-ids |
22:41
🔗
|
HeaD_ |
where can i paste it? |
22:41
🔗
|
omf_ |
paste.archivingyoursh.it or pad.archivingyoursh.it |
22:45
🔗
|
HeaD_ |
http://paste.archivingyoursh.it/wadarihewe.md |
22:47
🔗
|
Smiley |
for x,y in ./list_of_ids; do wget ....$x..$y; done |
22:47
🔗
|
Smiley |
except I'm unsure the exact wget we need. |
22:48
🔗
|
Smiley |
Oh wait thats line numbers ¬_¬ |
22:58
🔗
|
HeaD_ |
i found 11139 groups: http://paste.archivingyoursh.it/nexamejoyi.apache |
23:10
🔗
|
SketchCow |
http://archive.org/details/DOS.Memories.Project.1980-2003 |
23:13
🔗
|
SketchCow |
http://ia601705.us.archive.org/zipview.php?zip=/7/items/DOS.Memories.Project.1980-2003/DOS.Memories.Project.1980-2003.zip |
23:16
🔗
|
DFJustin |
https://archive.org/details/oldies-but-goldies-1740-games https://archive.org/details/Nextys_Archive https://archive.org/details/PC98_Games_1813 |
23:16
🔗
|
DFJustin |
zipview still doesn't like japanese filenames though :( |
23:18
🔗
|
DFJustin |
more to come too, I have a 7.75gb home of the underdogs set and a bunch more .jp stuff |