#archiveteam 2011-11-10,Thu

↑back Search

Time	Nickname	Message
00:41 ^🔗	underscor	http://archiveofourown.org/works/258626
00:57 ^🔗	chronomex	yes, yes.
00:58 ^🔗	underscor	chronomex: No way, you're beating me!
00:58 ^🔗	underscor	:'(
02:08 ^🔗	underscor	Coderjoe flatlined
02:24 ^🔗	chronomex	underscor: how the hell do I beat you
02:31 ^🔗	underscor	chronomex: download faster
04:03 ^🔗	yipdw	hm, that's annoying: you can plug in anything for the username in http://developer.berlios.de/devlog/username, and you'll get back 200 OK
04:33 ^🔗	underscor	uh oh, this wget-warc is using all my memory :(
04:47 ^🔗	yipdw	I get the feeling that BerliOS developer logs are pretty sparse
04:47 ^🔗	yipdw	I've checked 1,058 users so far and found 2 real devlogs
04:54 ^🔗	yipdw	curses, Paradoks is beating me on mobileme by a gigabyte!
04:54 ^🔗	yipdw	BUT NOT FOR LONG
04:57 ^🔗	chronomex	yeah. right.
05:04 ^🔗	rude___	buh
05:48 ^🔗	Coderjoe	underscor: yeah... running out of disk space will do that
05:57 ^🔗	Coderjoe	I'm probably going to pull off of this node once I finish syncing stuff up
05:57 ^🔗	delta_sav	Hi
05:59 ^🔗	delta_sav	Is anyone running archives on 4chan other than ones submitted by channers?
06:00 ^🔗	Coderjoe	I was doing manual saves of threads on my own. I was considering writing somethign to automatically queue threads to be downloaded as well, but never got to it. and then I stopped visiting 4chan for the most part
06:01 ^🔗	Coderjoe	(and the auto-archiving scares me a bit due to CP posts)
06:01 ^🔗	delta_sav	shizer you're right
06:02 ^🔗	Coderjoe	captain picard can be rather troublesome
06:02 ^🔗	Coderjoe	I would rather not be vanned
06:02 ^🔗	delta_sav	could grab the images briefly for md5sum then store thaht
06:03 ^🔗	delta_sav	I could write something to get everything
06:03 ^🔗	delta_sav	actually, I could do all three
06:03 ^🔗	delta_sav	I should
06:04 ^🔗	Coderjoe	I've already got something to download everything in a thread. it might need some tweaks, however.
06:04 ^🔗	delta_sav	you in the habbit of sharing?
06:04 ^🔗	Coderjoe	I just needed to write another script that ran through the index pages and queue up new threads
06:05 ^🔗	delta_sav	it updates pretty quick but I could imagine something that wouldnt miss a therad
06:05 ^🔗	delta_sav	I think 4chan needs to be archived
06:06 ^🔗	Coderjoe	http://wegetsignal.org/raper.sh
06:06 ^🔗	delta_sav	couldnt archive the images as well though or the storage would be too much
06:06 ^🔗	delta_sav	best domain name ever
06:06 ^🔗	Coderjoe	it has a few things hard coded, like it likes to reside in ~4chan, looks at a textfile named raper.threads for urls to the thread pages, etc
06:07 ^🔗	Coderjoe	this downloads the thread page and the images and thumbnails
06:07 ^🔗	delta_sav	i'm looking at it
06:07 ^🔗	Coderjoe	I don't know if it ever worked on the flash board
06:08 ^🔗	Coderjoe	but it would also not delete images that got deleted on the server
06:08 ^🔗	Coderjoe	or even re-download them
06:08 ^🔗	delta_sav	curl not wegt?
06:09 ^🔗	Coderjoe	I like wget
06:09 ^🔗	Coderjoe	and I make use if the -i parameter a lot
06:10 ^🔗	delta_sav	lol if this is an introduction I should say I've been doing data scraping for the last 6 months but just quit my job as a corparate whore :p
06:10 ^🔗	Coderjoe	the UA string at the top should give you a bit of an idea how long ago I wrote the script
06:10 ^🔗	delta_sav	my ua these days is "internet ready toaster oven"
06:10 ^🔗	Coderjoe	hell.. it even mentions the 4chan server named "img" which doesn't even exist anymore
06:11 ^🔗	Coderjoe	along with a workaround for img not returning 404 when a thread died
06:11 ^🔗	Coderjoe	er
06:11 ^🔗	Coderjoe	no img was, the others were not
06:12 ^🔗	delta_sav	not use to while(<>) in bash
06:12 ^🔗	delta_sav	errthing i take it?
06:13 ^🔗	bbot_	delta_sav: http://chanarchive.org/
06:13 ^🔗	bbot_	also http://archive.no-ip.org/
06:13 ^🔗	Coderjoe	i pretty much just ran this in a "while /bin/true; do blah; sleep; done" loop
06:13 ^🔗	delta_sav	no FAQ and more gives internal server error
06:13 ^🔗	bbot_	though neither of them redistribute archives, which is a shame
06:13 ^🔗	delta_sav	:{
06:14 ^🔗	delta_sav	4chan just may be the easiest way for anyone to say anything, which means it's prolly the most important thing to archive IMO
06:14 ^🔗	bbot_	maybe
06:15 ^🔗	Coderjoe	it might be better to rewrite in python or something, with a database for the thread queue
06:16 ^🔗	delta_sav	chanarchive.org looks solid but who are they?
06:16 ^🔗	delta_sav	how do you join/help
06:17 ^🔗	delta_sav	4chan is busy but not THAT busy, bash will do
06:17 ^🔗	delta_sav	bash -> mysql
06:17 ^🔗	delta_sav	to lamp for frontend for rest of internet personell
06:19 ^🔗	Coderjoe	my bash script is already a big hack. adding a database does not seem like a good thing to do.
06:19 ^🔗	Coderjoe	(in bash)
06:20 ^🔗	delta_sav	eventually it'll get pretty big, I'm not sure thats even an "eventually"
06:20 ^🔗	Coderjoe	the python wasn't about speed, but stability and readability. I could add an HTMLParser that properly handled the img and a tags, for example. it would be a lot cleaner and less fragile than the perl blob in the middle of that bash file
06:21 ^🔗	Coderjoe	er
06:21 ^🔗	Coderjoe	s/readability/reliability/
06:21 ^🔗	Coderjoe	stupid brain
06:21 ^🔗	delta_sav	heheheheheheh, not readable for me I'm from the land of C
06:22 ^🔗	Coderjoe	if you're a decent programmer, it shouldn't be difficult to read stuff written in most langauges
06:22 ^🔗	delta_sav	read no
06:22 ^🔗	delta_sav	write, it gets a lil tricky
06:23 ^🔗	Coderjoe	python makes it so much easier to whip up quick scripts to do complex things.
06:23 ^🔗	Coderjoe	you don't have to make them all OOP and everything if you don't want to
06:23 ^🔗	delta_sav	no
06:23 ^🔗	delta_sav	fuck OOP
06:24 ^🔗	Coderjoe	get out. :P
06:24 ^🔗	delta_sav	you've made some tasty bash
06:26 ^🔗	delta_sav	what's the best gide IUO
06:26 ^🔗	delta_sav	**guide
06:26 ^🔗	Coderjoe	guide to..?
06:26 ^🔗	delta_sav	advanced bash
06:27 ^🔗	Coderjoe	i dunno. I just figured it all out on my own with manpages and stuff
06:27 ^🔗	delta_sav	I see a whole shit-ton of caveats i didnt know so I'm curious
06:28 ^🔗	delta_sav	I do most of my quick-dev grunt work in bash... for said record
06:28 ^🔗	Coderjoe	i've been doing bash stuff for 17 years or so, though the most advanced bash stuff (arrays and stuff) i only started doing in the past 7 or so
06:29 ^🔗	Coderjoe	for me, it depends on what I need to do.
06:29 ^🔗	Coderjoe	I've done quick grunt stuff in bash, perl, python, and php
06:29 ^🔗	delta_sav	what do you use as a syntax ref?
06:29 ^🔗	Coderjoe	man pages and trial and error?
06:30 ^🔗	delta_sav	im bash/perl mostly, C for the fun stuff
06:30 ^🔗	Coderjoe	and also a few in C (my day job is mostly C++)
06:30 ^🔗	Coderjoe	C particularly if I don't need to do much string manipulation or things like database or the like
06:31 ^🔗	delta_sav	" if(/" I've never seen, what is?
06:32 ^🔗	Coderjoe	in the perl code? that's a regex match (the /sting/ part is)
06:32 ^🔗	delta_sav	'//i' built in regex?
06:32 ^🔗	Coderjoe	that's perl code
06:32 ^🔗	delta_sav	nah its in bash
06:32 ^🔗	Coderjoe	no it isn't
06:32 ^🔗	delta_sav	if(/<a[^>]+href="([^"]+src[^"]+.jpg)"/i)
06:33 ^🔗	Coderjoe	look at the lines above that... IMAGE=`cat file \| perl -e '
06:33 ^🔗	Coderjoe	it is a multiline bash script being passed as -e
06:35 ^🔗	Coderjoe	er, multiline PERL script
06:35 ^🔗	delta_sav	guess i dont get what while (<>) is
06:35 ^🔗	Coderjoe	again, perl
06:36 ^🔗	Coderjoe	loops through reading from standard input until end of file
06:36 ^🔗	delta_sav	thought thats _$
06:36 ^🔗	Coderjoe	into the variable $_
06:38 ^🔗	delta_sav	?
06:38 ^🔗	delta_sav	erm, so in bash a while (<>) immediatly after the def loops throughL
06:38 ^🔗	delta_sav	err, immediatly before
06:38 ^🔗	Coderjoe	no, that while line is part of a PERL script
06:39 ^🔗	delta_sav	oh shit its a backtick and a '
06:39 ^🔗	delta_sav	lol nm
06:39 ^🔗	delta_sav	I'm drunk, but do love archive team har
06:39 ^🔗	delta_sav	sorry
06:44 ^🔗	delta_sav	still dont get why no +~ tha
06:44 ^🔗	delta_sav	**tho
06:44 ^🔗	delta_sav	*****though
06:45 ^🔗	delta_sav	erm, =~
06:45 ^🔗	delta_sav	im sorry nm excuse me
06:47 ^🔗	Coderjoe	another reason for rewriting it in python... it gets away from switching langauges in the middle a few times.
08:34 ^🔗	Coderjoe	damn. 230gb behind already
15:38 ^🔗	Schbirid	if anyone wants to leech that emuwiki torrent files from me tell me now, i will delete the directory tomorrow
16:27 ^🔗	Nemo_bis	splinder.com closing, do you know?
16:28 ^🔗	Nemo_bis	(they have about half a million blogs, I think, mostly or only in Italian)
16:34 ^🔗	alard	When?
16:35 ^🔗	Nemo_bis	24 November, apparently
16:35 ^🔗	Nemo_bis	it's something like 50 millions pages, they say
16:35 ^🔗	Nemo_bis	I'm trying to understand where the date comes from
16:36 ^🔗	Nemo_bis	there's no official announcement yet AFAIK
16:39 ^🔗	Nemo_bis	delete spam -> http://archiveteam.org/index.php?title=Information
16:44 ^🔗	Nemo_bis	ah, found the source for the date
16:44 ^🔗	alard	Is there something on the wiki about splinder.com?
16:55 ^🔗	Nemo_bis	I've just created the page http://archiveteam.org/index.php?title=Splinder
16:58 ^🔗	alard	Good. I'm trying to download the list of users.
16:59 ^🔗	Nemo_bis	ok
17:00 ^🔗	alard	Then, if we're going to do this, we probably need to make a list of what users have.
17:00 ^🔗	Nemo_bis	do you need any help with the language?
17:02 ^🔗	alard	Well, the language I can manage, I can more or less decipher what it says. (And there's always the us version, right?)
17:02 ^🔗	alard	But making a list of things they have would be useful.
17:02 ^🔗	alard	Where do the 'ultimi commenti' come from?
17:14 ^🔗	alard	Nemo_bis: Are you editing the wiki at the moment? If not, I'll have a go.
17:14 ^🔗	Nemo_bis	alard, no, I'm not editing
17:15 ^🔗	alard	Okay.
17:15 ^🔗	Nemo_bis	hm, checking "ultimi commenti" (last comments)
17:15 ^🔗	alard	It's probably sourced from the blog and other places, I guess, not a separate source of data.
17:16 ^🔗	Nemo_bis	they're comments from all blogs
17:16 ^🔗	Nemo_bis	they're shown at the bottom of each blog post
17:16 ^🔗	Nemo_bis	but also separately as in http://www.splinder.com/myblog/comment/list/25742977
17:18 ^🔗	alard	Ah, ok.
17:19 ^🔗	alard	"I miei amici" => my friends, "Sono amico/a di" => friended by?
17:20 ^🔗	Nemo_bis	"I'm friend of"
17:20 ^🔗	Nemo_bis	but perhaps it's a status update? let me check
17:21 ^🔗	Nemo_bis	looks like a simple list, you mean http://www.splinder.com/profile/zoestyle/friendof ?
17:24 ^🔗	alard	Yes.
17:28 ^🔗	alard	What is missing? http://www.archiveteam.org/index.php?title=Splinder#Example_URLs
17:30 ^🔗	Nemo_bis	looking
17:31 ^🔗	alard	Comments are missing. I'd like to find examples (of comments on a media item, for example, preferably so many that there is pagination).
17:32 ^🔗	alard	Do you happen to have an account? Is there more information visible if you log in?
17:34 ^🔗	Nemo_bis	yes, I was going to ask about comments
17:34 ^🔗	Nemo_bis	no, I don't use splinder actually
17:35 ^🔗	Nemo_bis	all comments seem to be available in the same format as above, http://www.splinder.com/myblog/comment/list/<postID>
17:37 ^🔗	Nemo_bis	and for media it's e.g. http://www.splinder.com/media/comment/list/25744482
17:37 ^🔗	alard	Great. "Spiacente, non puoi commentare questo post!" probably means 'sorry, you can't/can no longer comment on this post'?
17:37 ^🔗	Nemo_bis	so it probably follows the same convention, with ?from=50 to see the next page etc.
17:37 ^🔗	Nemo_bis	yes
17:38 ^🔗	Nemo_bis	I've not found a way to increase the comments per page
17:38 ^🔗	alard	Do you happen to have found an example link with the comments pagination?
17:39 ^🔗	Nemo_bis	not yet
17:39 ^🔗	alard	Not even on the blog?
17:39 ^🔗	alard	(Where does the ?from=50 come from? Just a guess?)
17:41 ^🔗	Nemo_bis	no, clicking the next page link
17:41 ^🔗	Nemo_bis	found one: http://www.splinder.com/media/comment/list/21254470
17:41 ^🔗	Nemo_bis	(first google result here: http://ur1.ca/5qe9w )
17:42 ^🔗	alard	Wonderful. Not just a media item with comments, but a large one too.
17:44 ^🔗	Nemo_bis	I don't see a way to get the item url from the comments feed
17:45 ^🔗	Nemo_bis	but you're probably going to do it the other way round, I suppose
17:45 ^🔗	alard	No, I was just looking if I could find that. The comment system is the same, though, you can replace /media/ with /myblog/ and you still get the same comments.
17:45 ^🔗	Nemo_bis	ah
17:46 ^🔗	alard	Any chance of finding a blog post with lots of comments?
17:46 ^🔗	Nemo_bis	this explains why they don't have two series of ids
17:46 ^🔗	Nemo_bis	isn't http://www.splinder.com/myblog/comment/list/25742977 ok?
17:47 ^🔗	Nemo_bis	http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/
17:47 ^🔗	alard	I'd like to have a blog link. That's useful.
17:47 ^🔗	Nemo_bis	http://civati.splinder.com/post/25742977
17:47 ^🔗	Nemo_bis	(this is probably one of the main blogs here, this person is quite famous)
17:48 ^🔗	alard	That also tells us something about the url structure: with or without slugs at the end.
17:48 ^🔗	alard	Even more interesting: it shows that not every blog has the comments on the page.
17:48 ^🔗	Nemo_bis	yes :-/
17:49 ^🔗	alard	Is there a way to get an example of a media item with comments. (Not the comment page, but the media page that links there.)
17:50 ^🔗	alard	Oh, wait, never mind.
17:50 ^🔗	Nemo_bis	googled the comments? :)
17:50 ^🔗	alard	No, I just saw that the number of comments is listed on the media page. That's important information, since it saves a request to the comments page for most media items.
17:51 ^🔗	Nemo_bis	it was www.splinder.com/mediablog/danspo/media/21254470 anyway
17:51 ^🔗	alard	Although if you have an example, that's useful for testing.
17:51 ^🔗	alard	Ah, thanks.
17:52 ^🔗	alard	Now for someone with a lot of albums, to see the pagination there.
17:54 ^🔗	alard	Although I'm not sure that's interesting to download, since the album info is already listed with the items.
17:54 ^🔗	alard	What is 'condividi'?
17:56 ^🔗	Nemo_bis	"share"
17:57 ^🔗	Nemo_bis	I can't find any mediablog with lots of albums, still looking
17:59 ^🔗	alard	Well, leave it. It's not that important.
18:00 ^🔗	alard	What may be interesting is the video url.
18:00 ^🔗	alard	http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_small.flv
18:00 ^🔗	alard	_small suggests that there is something larger.
18:03 ^🔗	alard	Ah, it seems it depends on the video.
18:03 ^🔗	alard	http://files.splinder.com/e067653e1532e55ee208605fcb84361a.flv
18:03 ^🔗	alard	Doesn't have a small.
18:04 ^🔗	Nemo_bis	ah, found that the number of albums is limited for standard accounts, unlimited for pro
18:04 ^🔗	Nemo_bis	they downscale bigger videos?
18:06 ^🔗	alard	Not sure, haven't found a way to get anything other than _small.
18:06 ^🔗	alard	Unless there is no _small, but then other urls are different too. It depends on the video.
18:07 ^🔗	alard	Older videos are different.
18:08 ^🔗	alard	Ah, that is awkward. It's not just videos, also the images (newer ones have a predictable structure: id_square.jpg, id_medium.jpg etc.) Older ones have different ids for small, large etc.
18:09 ^🔗	*	Nemo_bis facepalms
18:11 ^🔗	alard	Is there any way to get a larger profile picture?
18:12 ^🔗	Nemo_bis	looking
18:17 ^🔗	alard	Probably not.
18:17 ^🔗	alard	I think the list is more or less complete: http://www.archiveteam.org/index.php?title=Splinder#Site_structure
18:17 ^🔗	Nemo_bis	can't find any in help pages, blogs etc.
18:20 ^🔗	alard	Ah, no, the audio.
18:26 ^🔗	alard	That's interesting, audio thumbnails: http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef_thumbnail.mp3 http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef.mp3
18:29 ^🔗	Nemo_bis	I've asked comments to splinder people...
18:30 ^🔗	alard	Not the people from splinder, I hope?
18:30 ^🔗	alard	:)
18:31 ^🔗	Nemo_bis	no :)
18:32 ^🔗	Nemo_bis	so the thumbnail is the same audio at 32 kb/s
18:34 ^🔗	alard	Yes, I think that's the difference. The duration is the same.
18:34 ^🔗	alard	Not all audio files have a thumbnail, by the way, older ones do not.
18:35 ^🔗	alard	What's the point of http://bloggando.splinder.com/ ? Is that just a normal blog?
18:36 ^🔗	alard	I guess it is, there's even a profile named 'bloggando', maybe something special by the company.
18:36 ^🔗	Nemo_bis	it's a manual selection of posts by them
18:37 ^🔗	Nemo_bis	they've also published some books
18:44 ^🔗	alard	This comment http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/comment/65653358#cid-65653358
18:44 ^🔗	alard	'il settore blog', would that be just the blogs, or the complete user content on the site?
18:49 ^🔗	alard	Later more.
18:53 ^🔗	Nemo_bis	it means the "blog division" of the company; splinder is a subset of it
20:00 ^🔗	bsmith094	im trying to wget a webcomic site, how do i get around 403 forbidden? tried ignoring robots, didn't work, any suggestions?
20:00 ^🔗	Nemo_bis	change user agent?
20:10 ^🔗	bsmith094	ok thanks, i just finally figured out the proper syntax for that, and apparently the site is only blocking googlebot from some python scripts
20:11 ^🔗	bsmith094	not the files so, woo, that worked! wget -U rocks
23:24 ^🔗	alard	Who can help with an experiment?
23:25 ^🔗	alard	Experiment is as follows: please git pull from https://github.com/ArchiveTeam/splinder-grab, get wget-warc, then see if you can download a profile from www.splinder.com
23:37 ^🔗	underscor	Example profile name?
23:42 ^🔗	PepsiMax	hey guys, 11/11/11
23:45 ^🔗	alard	underscor: lowvoice

irclogger-viewer