#archiveteam 2013-06-15,Sat

โ†‘back Search

Time Nickname Message
02:32 ๐Ÿ”— ivan` having trouble editing the wiki, getting 508
04:12 ๐Ÿ”— SketchCow It's working again.
04:58 ๐Ÿ”— SketchCow I AM UPLOADING SO MUCH MANGA
04:58 ๐Ÿ”— SketchCow alard: When you have a chance, I think the Xanga problem needs your help #xenga
04:58 ๐Ÿ”— SketchCow I mean #jenga
05:02 ๐Ÿ”— godane I'm uploading Science News Weekly
05:02 ๐Ÿ”— godane its a failed twit show
05:04 ๐Ÿ”— godane example url: https://archive.org/details/Science_News_Weekly_1
05:05 ๐Ÿ”— godane SketchCow: at some point we will need a twit sub-collection to put all my twit collections into
05:05 ๐Ÿ”— SketchCow Yeah.
05:05 ๐Ÿ”— godane if only to say here are the twit shows
05:06 ๐Ÿ”— godane it would also cut the collection number by like half in computers and tech videos
05:33 ๐Ÿ”— godane look what i found: https://www.tadaa.net/Lsjourney
05:35 ๐Ÿ”— SketchCow http://bitsavers.informatik.uni-stuttgart.de/bits/Walnut_Creek_CDROM/ fuck yeah
05:37 ๐Ÿ”— pft hahahaha that's awesome
05:45 ๐Ÿ”— godane IA looks like hit lsjourney.com on june 11 like 6 times
06:02 ๐Ÿ”— godane anyways all science news weekly is uploaded
06:02 ๐Ÿ”— godane there was only 16 episodes of it
07:23 ๐Ÿ”— SketchCow Just passed 1000 issues of manga
09:13 ๐Ÿ”— ivan` making a bz2 of all the URLs in common_crawl_index for convenient grepping with lbzip2
09:14 ๐Ÿ”— ivan` will take 6 days, sadly
09:14 ๐Ÿ”— ivan` ~437GB -> ~50GB
10:14 ๐Ÿ”— antomatic Nice!
10:21 ๐Ÿ”— ivan` it's only useful if you want to search for something in the /path, common_crawl_index is good at getting URLs for a domain
10:21 ๐Ÿ”— ivan` IA has at least 10x as many URLs though
12:14 ๐Ÿ”— zenguy_pc is there a tool like redditlog.com but for usres comments?
12:14 ๐Ÿ”— zenguy_pc i'd like to archove my comments and other comments
12:34 ๐Ÿ”— ivan` zenguy_pc: I typically use AutoPager + Mozilla Archive Format for that kind of thing
13:02 ๐Ÿ”— zenguy_pc ivan`: thanks .. will try it out
13:05 ๐Ÿ”— ivan` for your AutoPager rule on reddit.com/user/USERNAME/, use Link XPath //a[@rel='nofollow next'] and Content XPath //div[@id='siteTable']
13:07 ๐Ÿ”— ivan` if you need to back up many users though, I'd go with writing a wget-lua script
13:24 ๐Ÿ”— omf_ or just use the API
14:25 ๐Ÿ”— * Smiley looks in
15:52 ๐Ÿ”— omf_ WiK, Thinking back I realize I need to give you a better explanation. I know I can search the gitdigger data on my own instead of just submitting a pull request. I do the pull request so others who happen upon your work might use it if more of it is pre-chewed
17:35 ๐Ÿ”— Tephra GLaDOS: wow look at that ert.gr is now under construction again! with gif and all!
20:31 ๐Ÿ”— sydbarret Hi all, I'm trying to help archive Posterous, and when I go to the admin page (localhost:8001), the status message is, 'No item received. Retrying after 30 seconds...' It keeps displaying that ad nauseam.
20:33 ๐Ÿ”— wp494 syd: that's because all of the available items have already been assigned to someone
20:34 ๐Ÿ”— wp494 meaning that there's nothing more that the tracker can give
20:34 ๐Ÿ”— wp494 if you're running the AT choice, you're better off running the formspring project or the URLteam project
20:35 ๐Ÿ”— wp494 (speaking of AT choice, could we get that set to formspring since we're finishing posterous)?
20:36 ๐Ÿ”— sydbarret ok cool
20:37 ๐Ÿ”— sydbarret alright, it's working now...sorry if that was a silly question
20:38 ๐Ÿ”— sydbarret i was thinking perhaps i had an incorrect firewall setting or something
22:27 ๐Ÿ”— Wyatts Smiley reminded me recently that I should come here and surrender the 150GB of independently-produced, copyright-infringing, mostly-awful Japanese porno-comics that I...er, "happened upon".
22:27 ๐Ÿ”— Smiley \o \o
22:27 ๐Ÿ”— Smiley Wyatts: want to just upload them to IA...
22:28 ๐Ÿ”— Smiley fastest way if you already have the scans in a sensible format :)
22:28 ๐Ÿ”— Smiley I can point SketchCow to move them somewhere sensible once your done.
22:28 ๐Ÿ”— Smiley Or if you want to "put" them somewhere, I'll grab them and upload em
22:28 ๐Ÿ”— Wyatts Well that's part of what I was wondering about. 107G is in some sort of archive format.
22:28 ๐Ÿ”— Smiley haha, we deal in weird formats.
22:29 ๐Ÿ”— Wyatts The rest is crawled from a weird site.
22:29 ๐Ÿ”— Smiley and weird sites :D
22:29 ๐Ÿ”— DFJustin .cbz and .cbr are supported by IA
22:29 ๐Ÿ”— DFJustin so zip and rar can be trivially renamed to that, and folders of images can be zipped up
22:30 ๐Ÿ”— Wyatts So it's a...pretty rough set of trees of directories of landing pages and thumbnails. It's kind of awkward. (Also all Shift-JIS)
22:30 ๐Ÿ”— DFJustin SketchCow has a script uploading a bunch of scanlations to https://archive.org/details/manga_library
22:30 ๐Ÿ”— Smiley Wyatts: make a tarball, link me to it, I'll sort the rest?
22:31 ๐Ÿ”— Wyatts Smiley: It's 40G. That'll take weeks just to upload. I don't have any problems with hammering it all into some more-useful format. It's just a question of what that should be, I guess.
22:32 ๐Ÿ”— Wyatts (And there's lots more where that came from if there's some way of automating Mediafire downloads en masse)
22:32 ๐Ÿ”— Smiley Is the original still on mediafire?
22:34 ๐Ÿ”— Smiley OOOOOOOOO YEY my warc's are appearing in wayback (ign/gamespy)
22:35 ๐Ÿ”— Wyatts Okay, let me clarify. There are two sites. One of them was one I had to sort of....do some custom crawling on. That gave me a 40GB tree sites with divisions, landing pages, and subdirectories of images. That's all SJIS.
22:35 ๐Ÿ”— Wyatts The other was just straight HTTP links and, after deduplication ended up being 107G of...looks like mostly regular zips.
22:38 ๐Ÿ”— Wyatts The latter also has hundreds or thousands of links to mediafire pages with more zips. I was going to grab those too, but I couldn't figure out how to automate it.
22:39 ๐Ÿ”— Wyatts And then life happened and it all got backburnered.
22:42 ๐Ÿ”— Smiley Ah ok awesome
22:42 ๐Ÿ”— * Smiley ponders
22:42 ๐Ÿ”— Smiley So really two setsp
22:42 ๐Ÿ”— Smiley steps
22:42 ๐Ÿ”— Smiley 1. get the existing content out and uploaded
22:42 ๐Ÿ”— Wyatts The regular zips, sure, if it's fine to do so, can just get pushed into IA.
22:42 ๐Ÿ”— Smiley 2. get those links.
22:42 ๐Ÿ”— Smiley SJIS...... i need to check what this is D:
22:43 ๐Ÿ”— Wyatts Shift-JIS
22:43 ๐Ÿ”— DFJustin are the filenames in sjis or just contents of html
22:43 ๐Ÿ”— Smiley Wyatts: I'd personally just hopefully find -name *.jpeg .... and just tar em up
22:43 ๐Ÿ”— Smiley or zip, as DFJustin said.
22:44 ๐Ÿ”— Smiley Instead of uploading 40Gb, can you use that crazy compression that I've heard about and just reduce it massively, as everyone has spare CPU? XD
22:50 ๐Ÿ”— Wyatts DFJustin: Just the page content, but if I were to zip them up, giving them at least titles would be useful. I already extracted all the titles and ran them through iconv so the encoding is more a of a footnote.
22:50 ๐Ÿ”— DFJustin ok great
22:51 ๐Ÿ”— DFJustin ia doesn't like unicode filenames so the title would be more of a metadata thing
22:52 ๐Ÿ”— DFJustin have a look at this https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader
22:52 ๐Ÿ”— Smiley ^_^ this is why I offered to upload for you I have scripts to help.
22:52 ๐Ÿ”— Wyatts Here, more context: http://doujing.livedoor.biz. You have to go to the archive posts in the sidebar, and then you get links to subdomains of artemisweb.jp, with landing pages of thumbnails like so: http://s6.artemisweb.jp/doujing16/meg2/
22:52 ๐Ÿ”— Wyatts As if it wasn't clear alreadyy, NSFW
22:57 ๐Ÿ”— Wyatts And yeah, Smiley, there are literally thousands of 001.jpg, so it's not a simple find job. Considering the structure, that's why I was thinking it might be better to just individually zip each book.
22:58 ๐Ÿ”— DFJustin yeah just zip each book and rename to .cbz and you're good to go
22:59 ๐Ÿ”— Wyatts And the naming?
23:02 ๐Ÿ”— DFJustin well for the doujing ones use the directory name like meg2 I guess
23:02 ๐Ÿ”— DFJustin it doesn't really matter as long as the proper details are filled in as metadata
23:04 ๐Ÿ”— Wyatts All right, I'll have to look at munging that into the proper format for automating. And you said IA doesn't do unicode filenames? That...may be slightly problematic for the other ones.
23:05 ๐Ÿ”— Wyatts (รฅยยŒรคยบยบรจยชยŒ) [รฃยยŠรฃยยŸรฃยยนรขย˜ย…รฃยƒย€รฃย‚ยครฃยƒยŠรฃยƒยžรฃย‚ยครฃยƒย„(รฃยยŠรฃยยŸรฃยยนรฃยย•รฃยยรฃย‚ย‰)] รฉยญย”รฆยณย•รฉยขยจรคยฟย—รฃยƒย‡รฃยƒยชรฃยƒย˜รฃยƒยซรขย˜ย…รฃยƒยžรฃย‚ยฎรฃย‚ยซ 1 (รฉยญย”รฆยณย•รฅยฐย‘รฅยฅยณรฃยยพรฃยยฉรฃยย‹รขย˜ย†รฃยƒยžรฃย‚ยฎรฃย‚ยซ).zip รขย†ย Representative filename
23:07 ๐Ÿ”— Wyatts Lots of metadata in these filenames, too. Type of work, circle, artist, title, series. Some of them have extra tags before the extension.
23:12 ๐Ÿ”— Wyatts NSFW, http://s6.artemisweb.jp/doujing4/e38/02.html Wow, was trying to gauge how old some of these are and found this.
23:19 ๐Ÿ”— SketchCow All right, time to download crazy Pleasuredome stuff
23:30 ๐Ÿ”— godane more episodes of TWiCH is getting uploaded
23:38 ๐Ÿ”— godane I"M AWESOME
23:39 ๐Ÿ”— godane i maybe able to find the lost 10 episodes of twit live specials
23:40 ๐Ÿ”— godane the files was called like ces0001 before

irclogger-viewer