[00:41] http://archiveofourown.org/works/258626 [00:57] yes, yes. [00:58] chronomex: No way, you're beating me! [00:58] :'( [02:08] Coderjoe flatlined [02:24] underscor: how the hell do I beat you [02:31] chronomex: download faster [04:03] hm, that's annoying: you can plug in anything for the username in http://developer.berlios.de/devlog/username, and you'll get back 200 OK [04:33] uh oh, this wget-warc is using all my memory :( [04:47] I get the feeling that BerliOS developer logs are pretty sparse [04:47] I've checked 1,058 users so far and found 2 real devlogs [04:54] curses, Paradoks is beating me on mobileme by a gigabyte! [04:54] BUT NOT FOR LONG [04:57] yeah. right. [05:04] buh [05:48] underscor: yeah... running out of disk space will do that [05:57] I'm probably going to pull off of this node once I finish syncing stuff up [05:57] Hi [05:59] Is anyone running archives on 4chan other than ones submitted by channers? [06:00] I was doing manual saves of threads on my own. I was considering writing somethign to automatically queue threads to be downloaded as well, but never got to it. and then I stopped visiting 4chan for the most part [06:01] (and the auto-archiving scares me a bit due to CP posts) [06:01] shizer you're right [06:02] captain picard can be rather troublesome [06:02] I would rather not be vanned [06:02] could grab the images briefly for md5sum then store thaht [06:03] I could write something to get everything [06:03] actually, I could do all three [06:03] I should [06:04] I've already got something to download everything in a thread. it might need some tweaks, however. [06:04] you in the habbit of sharing? [06:04] I just needed to write another script that ran through the index pages and queue up new threads [06:05] it updates pretty quick but I could imagine something that wouldnt miss a therad [06:05] I think 4chan needs to be archived [06:06] http://wegetsignal.org/raper.sh [06:06] couldnt archive the images as well though or the storage would be too much [06:06] best domain name ever [06:06] it has a few things hard coded, like it likes to reside in ~4chan, looks at a textfile named raper.threads for urls to the thread pages, etc [06:07] this downloads the thread page and the images and thumbnails [06:07] i'm looking at it [06:07] I don't know if it ever worked on the flash board [06:08] but it would also not delete images that got deleted on the server [06:08] or even re-download them [06:08] curl not wegt? [06:09] I like wget [06:09] and I make use if the -i parameter a lot [06:10] lol if this is an introduction I should say I've been doing data scraping for the last 6 months but just quit my job as a corparate whore :p [06:10] the UA string at the top should give you a bit of an idea how long ago I wrote the script [06:10] my ua these days is "internet ready toaster oven" [06:10] hell.. it even mentions the 4chan server named "img" which doesn't even exist anymore [06:11] along with a workaround for img not returning 404 when a thread died [06:11] er [06:11] no img was, the others were not [06:12] not use to while(<>) in bash [06:12] errthing i take it? [06:13] delta_sav: http://chanarchive.org/ [06:13] also http://archive.no-ip.org/ [06:13] i pretty much just ran this in a "while /bin/true; do blah; sleep; done" loop [06:13] no FAQ and more gives internal server error [06:13] though neither of them redistribute archives, which is a shame [06:13] :{ [06:14] 4chan just may be the easiest way for anyone to say anything, which means it's prolly the most important thing to archive IMO [06:14] maybe [06:15] it might be better to rewrite in python or something, with a database for the thread queue [06:16] chanarchive.org looks solid but who are they? [06:16] how do you join/help [06:17] 4chan is busy but not THAT busy, bash will do [06:17] bash -> mysql [06:17] to lamp for frontend for rest of internet personell [06:19] my bash script is already a big hack. adding a database does not seem like a good thing to do. [06:19] (in bash) [06:20] eventually it'll get pretty big, I'm not sure thats even an "eventually" [06:20] the python wasn't about speed, but stability and readability. I could add an HTMLParser that properly handled the img and a tags, for example. it would be a lot cleaner and less fragile than the perl blob in the middle of that bash file [06:21] er [06:21] s/readability/reliability/ [06:21] stupid brain [06:21] heheheheheheh, not readable for me I'm from the land of C [06:22] if you're a decent programmer, it shouldn't be difficult to read stuff written in most langauges [06:22] read no [06:22] write, it gets a lil tricky [06:23] python makes it so much easier to whip up quick scripts to do complex things. [06:23] you don't have to make them all OOP and everything if you don't want to [06:23] no [06:23] fuck OOP [06:24] get out. :P [06:24] you've made some tasty bash [06:26] what's the best gide IUO [06:26] **guide [06:26] guide to..? [06:26] advanced bash [06:27] i dunno. I just figured it all out on my own with manpages and stuff [06:27] I see a whole shit-ton of caveats i didnt know so I'm curious [06:28] I do most of my quick-dev grunt work in bash... for said record [06:28] i've been doing bash stuff for 17 years or so, though the most advanced bash stuff (arrays and stuff) i only started doing in the past 7 or so [06:29] for me, it depends on what I need to do. [06:29] I've done quick grunt stuff in bash, perl, python, and php [06:29] what do you use as a syntax ref? [06:29] man pages and trial and error? [06:30] im bash/perl mostly, C for the fun stuff [06:30] and also a few in C (my day job is mostly C++) [06:30] C particularly if I don't need to do much string manipulation or things like database or the like [06:31] " if(/" I've never seen, what is? [06:32] in the perl code? that's a regex match (the /sting/ part is) [06:32] '//i' built in regex? [06:32] that's perl code [06:32] nah its in bash [06:32] no it isn't [06:32] if(/]+href="([^"]+src[^"]+.jpg)"/i) [06:33] look at the lines above that... IMAGE=`cat file | perl -e ' [06:33] it is a multiline bash script being passed as -e [06:35] er, multiline PERL script [06:35] guess i dont get what while (<>) is [06:35] again, perl [06:36] loops through reading from standard input until end of file [06:36] thought thats _$ [06:36] into the variable $_ [06:38] ? [06:38] erm, so in bash a while (<>) immediatly after the def loops throughL [06:38] err, immediatly before [06:38] no, that while line is part of a PERL script [06:39] oh shit its a backtick and a ' [06:39] lol nm [06:39] I'm drunk, but do love archive team har [06:39] sorry [06:44] still dont get why no +~ tha [06:44] **tho [06:44] *****though [06:45] erm, =~ [06:45] im sorry nm excuse me [06:47] another reason for rewriting it in python... it gets away from switching langauges in the middle a few times. [08:34] damn. 230gb behind already [15:38] if anyone wants to leech that emuwiki torrent files from me tell me now, i will delete the directory tomorrow [16:27] splinder.com closing, do you know? [16:28] (they have about half a million blogs, I think, mostly or only in Italian) [16:34] When? [16:35] 24 November, apparently [16:35] it's something like 50 millions pages, they say [16:35] I'm trying to understand where the date comes from [16:36] there's no official announcement yet AFAIK [16:39] delete spam -> http://archiveteam.org/index.php?title=Information [16:44] ah, found the source for the date [16:44] Is there something on the wiki about splinder.com? [16:55] I've just created the page http://archiveteam.org/index.php?title=Splinder [16:58] Good. I'm trying to download the list of users. [16:59] ok [17:00] Then, if we're going to do this, we probably need to make a list of what users have. [17:00] do you need any help with the language? [17:02] Well, the language I can manage, I can more or less decipher what it says. (And there's always the us version, right?) [17:02] But making a list of things they have would be useful. [17:02] Where do the 'ultimi commenti' come from? [17:14] Nemo_bis: Are you editing the wiki at the moment? If not, I'll have a go. [17:14] alard, no, I'm not editing [17:15] Okay. [17:15] hm, checking "ultimi commenti" (last comments) [17:15] It's probably sourced from the blog and other places, I guess, not a separate source of data. [17:16] they're comments from all blogs [17:16] they're shown at the bottom of each blog post [17:16] but also separately as in http://www.splinder.com/myblog/comment/list/25742977 [17:18] Ah, ok. [17:19] "I miei amici" => my friends, "Sono amico/a di" => friended by? [17:20] "I'm friend of" [17:20] but perhaps it's a status update? let me check [17:21] looks like a simple list, you mean http://www.splinder.com/profile/zoestyle/friendof ? [17:24] Yes. [17:28] What is missing? http://www.archiveteam.org/index.php?title=Splinder#Example_URLs [17:30] looking [17:31] Comments are missing. I'd like to find examples (of comments on a media item, for example, preferably so many that there is pagination). [17:32] Do you happen to have an account? Is there more information visible if you log in? [17:34] yes, I was going to ask about comments [17:34] no, I don't use splinder actually [17:35] all comments seem to be available in the same format as above, http://www.splinder.com/myblog/comment/list/ [17:37] and for media it's e.g. http://www.splinder.com/media/comment/list/25744482 [17:37] Great. "Spiacente, non puoi commentare questo post!" probably means 'sorry, you can't/can no longer comment on this post'? [17:37] so it probably follows the same convention, with ?from=50 to see the next page etc. [17:37] yes [17:38] I've not found a way to increase the comments per page [17:38] Do you happen to have found an example link with the comments pagination? [17:39] not yet [17:39] Not even on the blog? [17:39] (Where does the ?from=50 come from? Just a guess?) [17:41] no, clicking the next page link [17:41] found one: http://www.splinder.com/media/comment/list/21254470 [17:41] (first google result here: http://ur1.ca/5qe9w ) [17:42] Wonderful. Not just a media item with comments, but a large one too. [17:44] I don't see a way to get the item url from the comments feed [17:45] but you're probably going to do it the other way round, I suppose [17:45] No, I was just looking if I could find that. The comment system is the same, though, you can replace /media/ with /myblog/ and you still get the same comments. [17:45] ah [17:46] Any chance of finding a blog post with lots of comments? [17:46] this explains why they don't have two series of ids [17:46] isn't http://www.splinder.com/myblog/comment/list/25742977 ok? [17:47] http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/ [17:47] I'd like to have a blog link. That's useful. [17:47] http://civati.splinder.com/post/25742977 [17:47] (this is probably one of the main blogs here, this person is quite famous) [17:48] That also tells us something about the url structure: with or without slugs at the end. [17:48] Even more interesting: it shows that not every blog has the comments on the page. [17:48] yes :-/ [17:49] Is there a way to get an example of a media item with comments. (Not the comment page, but the media page that links there.) [17:50] Oh, wait, never mind. [17:50] googled the comments? :) [17:50] No, I just saw that the number of comments is listed on the media page. That's important information, since it saves a request to the comments page for most media items. [17:51] it was www.splinder.com/mediablog/danspo/media/21254470 anyway [17:51] Although if you have an example, that's useful for testing. [17:51] Ah, thanks. [17:52] Now for someone with a lot of albums, to see the pagination there. [17:54] Although I'm not sure that's interesting to download, since the album info is already listed with the items. [17:54] What is 'condividi'? [17:56] "share" [17:57] I can't find any mediablog with lots of albums, still looking [17:59] Well, leave it. It's not that important. [18:00] What may be interesting is the video url. [18:00] http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_small.flv [18:00] _small suggests that there is something larger. [18:03] Ah, it seems it depends on the video. [18:03] http://files.splinder.com/e067653e1532e55ee208605fcb84361a.flv [18:03] Doesn't have a small. [18:04] ah, found that the number of albums is limited for standard accounts, unlimited for pro [18:04] they downscale bigger videos? [18:06] Not sure, haven't found a way to get anything other than _small. [18:06] Unless there is no _small, but then other urls are different too. It depends on the video. [18:07] Older videos are different. [18:08] Ah, that is awkward. It's not just videos, also the images (newer ones have a predictable structure: id_square.jpg, id_medium.jpg etc.) Older ones have different ids for small, large etc. [18:09] * Nemo_bis facepalms [18:11] Is there any way to get a larger profile picture? [18:12] looking [18:17] Probably not. [18:17] I think the list is more or less complete: http://www.archiveteam.org/index.php?title=Splinder#Site_structure [18:17] can't find any in help pages, blogs etc. [18:20] Ah, no, the audio. [18:26] That's interesting, audio thumbnails: http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef_thumbnail.mp3 http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef.mp3 [18:29] I've asked comments to splinder people... [18:30] Not the people *from* splinder, I hope? [18:30] :) [18:31] no :) [18:32] so the thumbnail is the same audio at 32 kb/s [18:34] Yes, I think that's the difference. The duration is the same. [18:34] Not all audio files have a thumbnail, by the way, older ones do not. [18:35] What's the point of http://bloggando.splinder.com/ ? Is that just a normal blog? [18:36] I guess it is, there's even a profile named 'bloggando', maybe something special by the company. [18:36] it's a manual selection of posts by them [18:37] they've also published some books [18:44] This comment http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/comment/65653358#cid-65653358 [18:44] 'il settore blog', would that be just the blogs, or the complete user content on the site? [18:49] Later more. [18:53] it means the "blog division" of the company; splinder is a subset of it [20:00] im trying to wget a webcomic site, how do i get around 403 forbidden? tried ignoring robots, didn't work, any suggestions? [20:00] change user agent? [20:10] ok thanks, i just finally figured out the proper syntax for that, and apparently the site is only blocking googlebot from some python scripts [20:11] not the files so, woo, that worked! wget -U rocks [23:24] Who can help with an experiment? [23:25] Experiment is as follows: please git pull from https://github.com/ArchiveTeam/splinder-grab, get wget-warc, then see if you can download a profile from www.splinder.com [23:37] Example profile name? [23:42] hey guys, 11/11/11 [23:45] underscor: lowvoice