[00:25] https://archive.org/details/archive.pdp11.org.ru-20130504 [00:44] it looks like Feed API hits URLs outside /reader/ and is a lot more annoying to use [02:54] Slightly redid the page, nothing earth-shattering. [05:27] Time to set up some floppy scannin'! [05:28] I'm going to start with Compute! floppies, work from there. [07:52] aww floppies [07:53] number of (media) files on Wikimedia Foundation servers: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&m=swift_object_count&h=Swift+pmtpa+prod&c=Swift+pmtpa&trend=1 [07:53] mostly thumbnails [13:21] I uploaded the first block of 10,000 screenshots that are finished - http://archive.org/details/posterous_screens_01 [13:21] The item contains a tar of pngs, a log file and the list of urls used [13:22] i want your script so i can do that [13:22] What other information should I include? [14:41] SketchCow, when you have a minute could you flip this from software to texts Mediatype. I goofed the first one - http://archive.org/details/posterous_screens_01 [15:00] Done [15:05] SketchCow, you flipped the collection to Ebook and text, but the mediatype is still software [15:09] So I did, so I did. Fixed. [15:12] heads up, shouting coming up! [15:12] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [15:12] HUZZAH GOOD SIR, THE MAGIC WORD OF SECRET IS "yahoosucks" [15:19] SketchCow: i uploaded 3 of my scanned magazines [15:19] https://archive.org/details/pc-novice-1995-04 [15:19] https://archive.org/details/pc-novice-1995-05 [15:19] https://archive.org/details/pc-novice-1995-06 [15:23] thank ye, kind fere, henceforth I am proud to consider myself a proby among your fine band of warriors! [15:38] Why would you make this posterous screens collection text versus software? [15:52] what do archive now? [15:53] fuck me [15:54] looks like spark cbc links are not working for older episodes now [16:05] good news is the wayback machine last year took a good archive it looks like of it [16:23] so we may get lucky and have a very almost complete collection of cbc spark soon [16:49] whoo, 600mb till i've got the Textfiles backup stored locally! [18:53] is there a usenet archive that's free from the shackles of dejanews' new owners? [19:59] berndj: olduse.net? [20:02] ah. as long as the usenet legacy isn't vulnerable to getting geocitied [21:03] SketchCow, I should be setting this stuff collection "ourmedia", mediatype "image" for the posterous screens. That seems a better fit [21:22] Then again the screenshot grabs from last year are all mediatype: web [21:23] If anyone else has an opinion or view on this, I am open to suggestions [21:23] http://archive.org/details/archiveteam-fortunecity-screenshots for example [21:24] sounds good to me [21:27] Smiley, should I be setting them to mediatype image or web? [21:27] It is a creative commons collection of images that represent a crawl of the web [21:28] Well the web aspect is there via the webcrawl tag [21:28] Take note everyone. Metadata is harder than getting the data :) [21:32] I'd say image. [21:32] are the metatags used to work out how to display said data? [21:32] if so, it' needs to be image. [21:33] IA just mime types and magics files [21:33] it is the fastest and most common method [21:34] can you really trust user data? [21:34] the mediatype does determine the layout of the item page [21:34] e.g. "text" doesn't have links to the raw files unless you click through to the HTTP directory view [21:35] I was thinking of the data work on the files not the display [21:35] what DFJustin said ^^ [21:36] software/image/web/data all seem to have the same or basically the same layout for now, but that could change down the road [21:37] for screenshots I'd probably go image because web is more designed for warcs for the wayback machine, and in the future they may add thumbnail browsing or the like [22:06] Just got heads [22:06] up [22:06] Huge internal fight at pouet.net [22:06] We need to grab this thing [22:07] Awww crap, huge forum [22:07] The url format is pretty easy - http://pouet.net/topic.php?which=9389&page=1&x=25&y=11 [22:08] which is a thread [22:08] page is pagination [22:08] and the x & y can be left off [22:09] ohshit [22:09] pouet.net is important [22:09] Doing "normal" wget grab to warc [22:09] which is also everything else [22:10] well the server is nice and fast at least. [22:11] not for long ;) [22:11] ;) [22:11] This should be pretty easy to run on the warrior [22:12] get to it then guys [22:12] :D [22:12] Not like I know how to setup warrior tasks yet and have far too much to do :( [22:12] Smiley, For once your not actually bored?!? [22:12] there is also this url pattern [22:13] http://pouet.net/download.php?which=55993 [22:14] omf_: nope, far too mcuh to do! :D [22:18] wow so I just read up on pouet and what is going on [22:19] A user wrote them a new code base and is now using it as leverage to make a land grab [22:19] and the users are freaking out that all the data is going to get closed and erased [22:21] omf_: where is this? [22:21] as well they should [22:21] if they can't figure out how to migrate the data, they should freeze and archive the old site as read-only. [22:23] balrog, it has nothing to do with data migration and more to do with who is going to "own" the data in the future [22:24] like thingiverse? [22:24] but not quite. [22:24] It is 10 pages of comments so far [22:26] Also pouet made the mistake of not opening source the 2.0 from the beginning of the project and this multiyear closed development project is full of ego [22:26] ughhhh [22:27] They let a coder run wild on his own for years, what did they expect to happen [22:31] downloading the files is going to be tricky [22:31] omf_: I've heard that FurAffinity has had similar internal political issues for what, the past 2 years? Not that I have an account there, as I don't [22:36] I do. [22:36] The URLs are straightforward incrementing ID numbers. [22:37] Journals and submissions would get you >80% of the "interesting" data. Userpages (linked from submissions) would get you the rest. [22:37] Parse the page and look for usernames in the comments, look who submitted the thing etc. [22:38] The only catch is that they have options for restricting viewing to logged-in users; also anything rated NSFW requires login (and account set to "allow adult content") to see it. [22:38] I do know that they deliberately block all robots including googlebot [22:39] Well yeah, robots.txt [22:39] well yeah that's what I meant [22:39] They used to allow them until Dragoneer had a hissy fit about it [22:39] Robots.txt does not mean shit, when I think blocking I think of the shit google and yahoo do [22:40] Deviantart, Inkbunny, SoFurry, Weasyl and Nabyn all allow bots in. Better to let them index the site, then you can find stuff with internet searches. (but we're straying off topic) [22:40] greped over http://pouet.net/groups.php?pattern=[a-z] [22:41] got 61243 prod-ids [22:41] where can i paste it? [22:41] paste.archivingyoursh.it or pad.archivingyoursh.it [22:45] http://paste.archivingyoursh.it/wadarihewe.md [22:47] for x,y in ./list_of_ids; do wget ....$x..$y; done [22:47] except I'm unsure the exact wget we need. [22:48] Oh wait thats line numbers ¬_¬ [22:58] i found 11139 groups: http://paste.archivingyoursh.it/nexamejoyi.apache [23:10] http://archive.org/details/DOS.Memories.Project.1980-2003 [23:13] http://ia601705.us.archive.org/zipview.php?zip=/7/items/DOS.Memories.Project.1980-2003/DOS.Memories.Project.1980-2003.zip [23:16] https://archive.org/details/oldies-but-goldies-1740-games https://archive.org/details/Nextys_Archive https://archive.org/details/PC98_Games_1813 [23:16] zipview still doesn't like japanese filenames though :( [23:18] more to come too, I have a 7.75gb home of the underdogs set and a bunch more .jp stuff