[02:32] having trouble editing the wiki, getting 508 [04:12] It's working again. [04:58] I AM UPLOADING SO MUCH MANGA [04:58] alard: When you have a chance, I think the Xanga problem needs your help #xenga [04:58] I mean #jenga [05:02] I'm uploading Science News Weekly [05:02] its a failed twit show [05:04] example url: https://archive.org/details/Science_News_Weekly_1 [05:05] SketchCow: at some point we will need a twit sub-collection to put all my twit collections into [05:05] Yeah. [05:05] if only to say here are the twit shows [05:06] it would also cut the collection number by like half in computers and tech videos [05:33] look what i found: https://www.tadaa.net/Lsjourney [05:35] http://bitsavers.informatik.uni-stuttgart.de/bits/Walnut_Creek_CDROM/ fuck yeah [05:37] hahahaha that's awesome [05:45] IA looks like hit lsjourney.com on june 11 like 6 times [06:02] anyways all science news weekly is uploaded [06:02] there was only 16 episodes of it [07:23] Just passed 1000 issues of manga [09:13] making a bz2 of all the URLs in common_crawl_index for convenient grepping with lbzip2 [09:14] will take 6 days, sadly [09:14] ~437GB -> ~50GB [10:14] Nice! [10:21] it's only useful if you want to search for something in the /path, common_crawl_index is good at getting URLs for a domain [10:21] IA has at least 10x as many URLs though [12:14] is there a tool like redditlog.com but for usres comments? [12:14] i'd like to archove my comments and other comments [12:34] zenguy_pc: I typically use AutoPager + Mozilla Archive Format for that kind of thing [13:02] ivan`: thanks .. will try it out [13:05] for your AutoPager rule on reddit.com/user/USERNAME/, use Link XPath //a[@rel='nofollow next'] and Content XPath //div[@id='siteTable'] [13:07] if you need to back up many users though, I'd go with writing a wget-lua script [13:24] or just use the API [14:25] * Smiley looks in [15:52] WiK, Thinking back I realize I need to give you a better explanation. I know I can search the gitdigger data on my own instead of just submitting a pull request. I do the pull request so others who happen upon your work might use it if more of it is pre-chewed [17:35] GLaDOS: wow look at that ert.gr is now under construction again! with gif and all! [20:31] Hi all, I'm trying to help archive Posterous, and when I go to the admin page (localhost:8001), the status message is, 'No item received. Retrying after 30 seconds...' It keeps displaying that ad nauseam. [20:33] syd: that's because all of the available items have already been assigned to someone [20:34] meaning that there's nothing more that the tracker can give [20:34] if you're running the AT choice, you're better off running the formspring project or the URLteam project [20:35] (speaking of AT choice, could we get that set to formspring since we're finishing posterous)? [20:36] ok cool [20:37] alright, it's working now...sorry if that was a silly question [20:38] i was thinking perhaps i had an incorrect firewall setting or something [22:27] Smiley reminded me recently that I should come here and surrender the 150GB of independently-produced, copyright-infringing, mostly-awful Japanese porno-comics that I...er, "happened upon". [22:27] \o \o [22:27] Wyatts: want to just upload them to IA... [22:28] fastest way if you already have the scans in a sensible format :) [22:28] I can point SketchCow to move them somewhere sensible once your done. [22:28] Or if you want to "put" them somewhere, I'll grab them and upload em [22:28] Well that's part of what I was wondering about. 107G is in some sort of archive format. [22:28] haha, we deal in weird formats. [22:29] The rest is crawled from a weird site. [22:29] and weird sites :D [22:29] .cbz and .cbr are supported by IA [22:29] so zip and rar can be trivially renamed to that, and folders of images can be zipped up [22:30] So it's a...pretty rough set of trees of directories of landing pages and thumbnails. It's kind of awkward. (Also all Shift-JIS) [22:30] SketchCow has a script uploading a bunch of scanlations to https://archive.org/details/manga_library [22:30] Wyatts: make a tarball, link me to it, I'll sort the rest? [22:31] Smiley: It's 40G. That'll take weeks just to upload. I don't have any problems with hammering it all into some more-useful format. It's just a question of what that should be, I guess. [22:32] (And there's lots more where that came from if there's some way of automating Mediafire downloads en masse) [22:32] Is the original still on mediafire? [22:34] OOOOOOOOO YEY my warc's are appearing in wayback (ign/gamespy) [22:35] Okay, let me clarify. There are two sites. One of them was one I had to sort of....do some custom crawling on. That gave me a 40GB tree sites with divisions, landing pages, and subdirectories of images. That's all SJIS. [22:35] The other was just straight HTTP links and, after deduplication ended up being 107G of...looks like mostly regular zips. [22:38] The latter also has hundreds or thousands of links to mediafire pages with more zips. I was going to grab those too, but I couldn't figure out how to automate it. [22:39] And then life happened and it all got backburnered. [22:42] Ah ok awesome [22:42] * Smiley ponders [22:42] So really two setsp [22:42] steps [22:42] 1. get the existing content out and uploaded [22:42] The regular zips, sure, if it's fine to do so, can just get pushed into IA. [22:42] 2. get those links. [22:42] SJIS...... i need to check what this is D: [22:43] Shift-JIS [22:43] are the filenames in sjis or just contents of html [22:43] Wyatts: I'd personally just hopefully find -name *.jpeg .... and just tar em up [22:43] or zip, as DFJustin said. [22:44] Instead of uploading 40Gb, can you use that crazy compression that I've heard about and just reduce it massively, as everyone has spare CPU? XD [22:50] DFJustin: Just the page content, but if I were to zip them up, giving them at least titles would be useful. I already extracted all the titles and ran them through iconv so the encoding is more a of a footnote. [22:50] ok great [22:51] ia doesn't like unicode filenames so the title would be more of a metadata thing [22:52] have a look at this https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader [22:52] ^_^ this is why I offered to upload for you I have scripts to help. [22:52] Here, more context: http://doujing.livedoor.biz. You have to go to the archive posts in the sidebar, and then you get links to subdomains of artemisweb.jp, with landing pages of thumbnails like so: http://s6.artemisweb.jp/doujing16/meg2/ [22:52] As if it wasn't clear alreadyy, NSFW [22:57] And yeah, Smiley, there are literally thousands of 001.jpg, so it's not a simple find job. Considering the structure, that's why I was thinking it might be better to just individually zip each book. [22:58] yeah just zip each book and rename to .cbz and you're good to go [22:59] And the naming? [23:02] well for the doujing ones use the directory name like meg2 I guess [23:02] it doesn't really matter as long as the proper details are filled in as metadata [23:04] All right, I'll have to look at munging that into the proper format for automating. And you said IA doesn't do unicode filenames? That...may be slightly problematic for the other ones. [23:05] (同人誌) [おたべ★ダイナマイツ(おたべさくら)] 魔法風俗デリヘル★マギカ 1 (魔法少女まどか☆マギカ).zip ← Representative filename [23:07] Lots of metadata in these filenames, too. Type of work, circle, artist, title, series. Some of them have extra tags before the extension. [23:12] NSFW, http://s6.artemisweb.jp/doujing4/e38/02.html Wow, was trying to gauge how old some of these are and found this. [23:19] All right, time to download crazy Pleasuredome stuff [23:30] more episodes of TWiCH is getting uploaded [23:38] I"M AWESOME [23:39] i maybe able to find the lost 10 episodes of twit live specials [23:40] the files was called like ces0001 before