#archiveteam 2013-06-15,Sat

↑back Search

Time	Nickname	Message
02:32 ^🔗	ivan`	having trouble editing the wiki, getting 508
04:12 ^🔗	SketchCow	It's working again.
04:58 ^🔗	SketchCow	I AM UPLOADING SO MUCH MANGA
04:58 ^🔗	SketchCow	alard: When you have a chance, I think the Xanga problem needs your help #xenga
04:58 ^🔗	SketchCow	I mean #jenga
05:02 ^🔗	godane	I'm uploading Science News Weekly
05:02 ^🔗	godane	its a failed twit show
05:04 ^🔗	godane	example url: https://archive.org/details/Science_News_Weekly_1
05:05 ^🔗	godane	SketchCow: at some point we will need a twit sub-collection to put all my twit collections into
05:05 ^🔗	SketchCow	Yeah.
05:05 ^🔗	godane	if only to say here are the twit shows
05:06 ^🔗	godane	it would also cut the collection number by like half in computers and tech videos
05:33 ^🔗	godane	look what i found: https://www.tadaa.net/Lsjourney
05:35 ^🔗	SketchCow	http://bitsavers.informatik.uni-stuttgart.de/bits/Walnut_Creek_CDROM/ fuck yeah
05:37 ^🔗	pft	hahahaha that's awesome
05:45 ^🔗	godane	IA looks like hit lsjourney.com on june 11 like 6 times
06:02 ^🔗	godane	anyways all science news weekly is uploaded
06:02 ^🔗	godane	there was only 16 episodes of it
07:23 ^🔗	SketchCow	Just passed 1000 issues of manga
09:13 ^🔗	ivan`	making a bz2 of all the URLs in common_crawl_index for convenient grepping with lbzip2
09:14 ^🔗	ivan`	will take 6 days, sadly
09:14 ^🔗	ivan`	~437GB -> ~50GB
10:14 ^🔗	antomatic	Nice!
10:21 ^🔗	ivan`	it's only useful if you want to search for something in the /path, common_crawl_index is good at getting URLs for a domain
10:21 ^🔗	ivan`	IA has at least 10x as many URLs though
12:14 ^🔗	zenguy_pc	is there a tool like redditlog.com but for usres comments?
12:14 ^🔗	zenguy_pc	i'd like to archove my comments and other comments
12:34 ^🔗	ivan`	zenguy_pc: I typically use AutoPager + Mozilla Archive Format for that kind of thing
13:02 ^🔗	zenguy_pc	ivan`: thanks .. will try it out
13:05 ^🔗	ivan`	for your AutoPager rule on reddit.com/user/USERNAME/, use Link XPath //a[@rel='nofollow next'] and Content XPath //div[@id='siteTable']
13:07 ^🔗	ivan`	if you need to back up many users though, I'd go with writing a wget-lua script
13:24 ^🔗	omf_	or just use the API
14:25 ^🔗	*	Smiley looks in
15:52 ^🔗	omf_	WiK, Thinking back I realize I need to give you a better explanation. I know I can search the gitdigger data on my own instead of just submitting a pull request. I do the pull request so others who happen upon your work might use it if more of it is pre-chewed
17:35 ^🔗	Tephra	GLaDOS: wow look at that ert.gr is now under construction again! with gif and all!
20:31 ^🔗	sydbarret	Hi all, I'm trying to help archive Posterous, and when I go to the admin page (localhost:8001), the status message is, 'No item received. Retrying after 30 seconds...' It keeps displaying that ad nauseam.
20:33 ^🔗	wp494	syd: that's because all of the available items have already been assigned to someone
20:34 ^🔗	wp494	meaning that there's nothing more that the tracker can give
20:34 ^🔗	wp494	if you're running the AT choice, you're better off running the formspring project or the URLteam project
20:35 ^🔗	wp494	(speaking of AT choice, could we get that set to formspring since we're finishing posterous)?
20:36 ^🔗	sydbarret	ok cool
20:37 ^🔗	sydbarret	alright, it's working now...sorry if that was a silly question
20:38 ^🔗	sydbarret	i was thinking perhaps i had an incorrect firewall setting or something
22:27 ^🔗	Wyatts	Smiley reminded me recently that I should come here and surrender the 150GB of independently-produced, copyright-infringing, mostly-awful Japanese porno-comics that I...er, "happened upon".
22:27 ^🔗	Smiley	\o \o
22:27 ^🔗	Smiley	Wyatts: want to just upload them to IA...
22:28 ^🔗	Smiley	fastest way if you already have the scans in a sensible format :)
22:28 ^🔗	Smiley	I can point SketchCow to move them somewhere sensible once your done.
22:28 ^🔗	Smiley	Or if you want to "put" them somewhere, I'll grab them and upload em
22:28 ^🔗	Wyatts	Well that's part of what I was wondering about. 107G is in some sort of archive format.
22:28 ^🔗	Smiley	haha, we deal in weird formats.
22:29 ^🔗	Wyatts	The rest is crawled from a weird site.
22:29 ^🔗	Smiley	and weird sites :D
22:29 ^🔗	DFJustin	.cbz and .cbr are supported by IA
22:29 ^🔗	DFJustin	so zip and rar can be trivially renamed to that, and folders of images can be zipped up
22:30 ^🔗	Wyatts	So it's a...pretty rough set of trees of directories of landing pages and thumbnails. It's kind of awkward. (Also all Shift-JIS)
22:30 ^🔗	DFJustin	SketchCow has a script uploading a bunch of scanlations to https://archive.org/details/manga_library
22:30 ^🔗	Smiley	Wyatts: make a tarball, link me to it, I'll sort the rest?
22:31 ^🔗	Wyatts	Smiley: It's 40G. That'll take weeks just to upload. I don't have any problems with hammering it all into some more-useful format. It's just a question of what that should be, I guess.
22:32 ^🔗	Wyatts	(And there's lots more where that came from if there's some way of automating Mediafire downloads en masse)
22:32 ^🔗	Smiley	Is the original still on mediafire?
22:34 ^🔗	Smiley	OOOOOOOOO YEY my warc's are appearing in wayback (ign/gamespy)
22:35 ^🔗	Wyatts	Okay, let me clarify. There are two sites. One of them was one I had to sort of....do some custom crawling on. That gave me a 40GB tree sites with divisions, landing pages, and subdirectories of images. That's all SJIS.
22:35 ^🔗	Wyatts	The other was just straight HTTP links and, after deduplication ended up being 107G of...looks like mostly regular zips.
22:38 ^🔗	Wyatts	The latter also has hundreds or thousands of links to mediafire pages with more zips. I was going to grab those too, but I couldn't figure out how to automate it.
22:39 ^🔗	Wyatts	And then life happened and it all got backburnered.
22:42 ^🔗	Smiley	Ah ok awesome
22:42 ^🔗	*	Smiley ponders
22:42 ^🔗	Smiley	So really two setsp
22:42 ^🔗	Smiley	steps
22:42 ^🔗	Smiley	1. get the existing content out and uploaded
22:42 ^🔗	Wyatts	The regular zips, sure, if it's fine to do so, can just get pushed into IA.
22:42 ^🔗	Smiley	2. get those links.
22:42 ^🔗	Smiley	SJIS...... i need to check what this is D:
22:43 ^🔗	Wyatts	Shift-JIS
22:43 ^🔗	DFJustin	are the filenames in sjis or just contents of html
22:43 ^🔗	Smiley	Wyatts: I'd personally just hopefully find -name *.jpeg .... and just tar em up
22:43 ^🔗	Smiley	or zip, as DFJustin said.
22:44 ^🔗	Smiley	Instead of uploading 40Gb, can you use that crazy compression that I've heard about and just reduce it massively, as everyone has spare CPU? XD
22:50 ^🔗	Wyatts	DFJustin: Just the page content, but if I were to zip them up, giving them at least titles would be useful. I already extracted all the titles and ran them through iconv so the encoding is more a of a footnote.
22:50 ^🔗	DFJustin	ok great
22:51 ^🔗	DFJustin	ia doesn't like unicode filenames so the title would be more of a metadata thing
22:52 ^🔗	DFJustin	have a look at this https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader
22:52 ^🔗	Smiley	^_^ this is why I offered to upload for you I have scripts to help.
22:52 ^🔗	Wyatts	Here, more context: http://doujing.livedoor.biz. You have to go to the archive posts in the sidebar, and then you get links to subdomains of artemisweb.jp, with landing pages of thumbnails like so: http://s6.artemisweb.jp/doujing16/meg2/
22:52 ^🔗	Wyatts	As if it wasn't clear alreadyy, NSFW
22:57 ^🔗	Wyatts	And yeah, Smiley, there are literally thousands of 001.jpg, so it's not a simple find job. Considering the structure, that's why I was thinking it might be better to just individually zip each book.
22:58 ^🔗	DFJustin	yeah just zip each book and rename to .cbz and you're good to go
22:59 ^🔗	Wyatts	And the naming?
23:02 ^🔗	DFJustin	well for the doujing ones use the directory name like meg2 I guess
23:02 ^🔗	DFJustin	it doesn't really matter as long as the proper details are filled in as metadata
23:04 ^🔗	Wyatts	All right, I'll have to look at munging that into the proper format for automating. And you said IA doesn't do unicode filenames? That...may be slightly problematic for the other ones.
23:05 ^🔗	Wyatts	(åäººèª) [ããã¹âãã¤ããã¤ã(ããã¹ããã)] éæ³é¢¨ä¿ããªãã«âãã®ã« 1 (éæ³å°å¥³ã¾ã©ãâãã®ã«).zip â Representative filename
23:07 ^🔗	Wyatts	Lots of metadata in these filenames, too. Type of work, circle, artist, title, series. Some of them have extra tags before the extension.
23:12 ^🔗	Wyatts	NSFW, http://s6.artemisweb.jp/doujing4/e38/02.html Wow, was trying to gauge how old some of these are and found this.
23:19 ^🔗	SketchCow	All right, time to download crazy Pleasuredome stuff
23:30 ^🔗	godane	more episodes of TWiCH is getting uploaded
23:38 ^🔗	godane	I"M AWESOME
23:39 ^🔗	godane	i maybe able to find the lost 10 episodes of twit live specials
23:40 ^🔗	godane	the files was called like ces0001 before

irclogger-viewer