#archiveteam 2011-11-10,Thu

↑back Search

Time Nickname Message
00:41 🔗 underscor http://archiveofourown.org/works/258626
00:57 🔗 chronomex yes, yes.
00:58 🔗 underscor chronomex: No way, you're beating me!
00:58 🔗 underscor :'(
02:08 🔗 underscor Coderjoe flatlined
02:24 🔗 chronomex underscor: how the hell do I beat you
02:31 🔗 underscor chronomex: download faster
04:03 🔗 yipdw hm, that's annoying: you can plug in anything for the username in http://developer.berlios.de/devlog/username, and you'll get back 200 OK
04:33 🔗 underscor uh oh, this wget-warc is using all my memory :(
04:47 🔗 yipdw I get the feeling that BerliOS developer logs are pretty sparse
04:47 🔗 yipdw I've checked 1,058 users so far and found 2 real devlogs
04:54 🔗 yipdw curses, Paradoks is beating me on mobileme by a gigabyte!
04:54 🔗 yipdw BUT NOT FOR LONG
04:57 🔗 chronomex yeah. right.
05:04 🔗 rude___ buh
05:48 🔗 Coderjoe underscor: yeah... running out of disk space will do that
05:57 🔗 Coderjoe I'm probably going to pull off of this node once I finish syncing stuff up
05:57 🔗 delta_sav Hi
05:59 🔗 delta_sav Is anyone running archives on 4chan other than ones submitted by channers?
06:00 🔗 Coderjoe I was doing manual saves of threads on my own. I was considering writing somethign to automatically queue threads to be downloaded as well, but never got to it. and then I stopped visiting 4chan for the most part
06:01 🔗 Coderjoe (and the auto-archiving scares me a bit due to CP posts)
06:01 🔗 delta_sav shizer you're right
06:02 🔗 Coderjoe captain picard can be rather troublesome
06:02 🔗 Coderjoe I would rather not be vanned
06:02 🔗 delta_sav could grab the images briefly for md5sum then store thaht
06:03 🔗 delta_sav I could write something to get everything
06:03 🔗 delta_sav actually, I could do all three
06:03 🔗 delta_sav I should
06:04 🔗 Coderjoe I've already got something to download everything in a thread. it might need some tweaks, however.
06:04 🔗 delta_sav you in the habbit of sharing?
06:04 🔗 Coderjoe I just needed to write another script that ran through the index pages and queue up new threads
06:05 🔗 delta_sav it updates pretty quick but I could imagine something that wouldnt miss a therad
06:05 🔗 delta_sav I think 4chan needs to be archived
06:06 🔗 Coderjoe http://wegetsignal.org/raper.sh
06:06 🔗 delta_sav couldnt archive the images as well though or the storage would be too much
06:06 🔗 delta_sav best domain name ever
06:06 🔗 Coderjoe it has a few things hard coded, like it likes to reside in ~4chan, looks at a textfile named raper.threads for urls to the thread pages, etc
06:07 🔗 Coderjoe this downloads the thread page and the images and thumbnails
06:07 🔗 delta_sav i'm looking at it
06:07 🔗 Coderjoe I don't know if it ever worked on the flash board
06:08 🔗 Coderjoe but it would also not delete images that got deleted on the server
06:08 🔗 Coderjoe or even re-download them
06:08 🔗 delta_sav curl not wegt?
06:09 🔗 Coderjoe I like wget
06:09 🔗 Coderjoe and I make use if the -i parameter a lot
06:10 🔗 delta_sav lol if this is an introduction I should say I've been doing data scraping for the last 6 months but just quit my job as a corparate whore :p
06:10 🔗 Coderjoe the UA string at the top should give you a bit of an idea how long ago I wrote the script
06:10 🔗 delta_sav my ua these days is "internet ready toaster oven"
06:10 🔗 Coderjoe hell.. it even mentions the 4chan server named "img" which doesn't even exist anymore
06:11 🔗 Coderjoe along with a workaround for img not returning 404 when a thread died
06:11 🔗 Coderjoe er
06:11 🔗 Coderjoe no img was, the others were not
06:12 🔗 delta_sav not use to while(<>) in bash
06:12 🔗 delta_sav errthing i take it?
06:13 🔗 bbot_ delta_sav: http://chanarchive.org/
06:13 🔗 bbot_ also http://archive.no-ip.org/
06:13 🔗 Coderjoe i pretty much just ran this in a "while /bin/true; do blah; sleep; done" loop
06:13 🔗 delta_sav no FAQ and more gives internal server error
06:13 🔗 bbot_ though neither of them redistribute archives, which is a shame
06:13 🔗 delta_sav :{
06:14 🔗 delta_sav 4chan just may be the easiest way for anyone to say anything, which means it's prolly the most important thing to archive IMO
06:14 🔗 bbot_ maybe
06:15 🔗 Coderjoe it might be better to rewrite in python or something, with a database for the thread queue
06:16 🔗 delta_sav chanarchive.org looks solid but who are they?
06:16 🔗 delta_sav how do you join/help
06:17 🔗 delta_sav 4chan is busy but not THAT busy, bash will do
06:17 🔗 delta_sav bash -> mysql
06:17 🔗 delta_sav to lamp for frontend for rest of internet personell
06:19 🔗 Coderjoe my bash script is already a big hack. adding a database does not seem like a good thing to do.
06:19 🔗 Coderjoe (in bash)
06:20 🔗 delta_sav eventually it'll get pretty big, I'm not sure thats even an "eventually"
06:20 🔗 Coderjoe the python wasn't about speed, but stability and readability. I could add an HTMLParser that properly handled the img and a tags, for example. it would be a lot cleaner and less fragile than the perl blob in the middle of that bash file
06:21 🔗 Coderjoe er
06:21 🔗 Coderjoe s/readability/reliability/
06:21 🔗 Coderjoe stupid brain
06:21 🔗 delta_sav heheheheheheh, not readable for me I'm from the land of C
06:22 🔗 Coderjoe if you're a decent programmer, it shouldn't be difficult to read stuff written in most langauges
06:22 🔗 delta_sav read no
06:22 🔗 delta_sav write, it gets a lil tricky
06:23 🔗 Coderjoe python makes it so much easier to whip up quick scripts to do complex things.
06:23 🔗 Coderjoe you don't have to make them all OOP and everything if you don't want to
06:23 🔗 delta_sav no
06:23 🔗 delta_sav fuck OOP
06:24 🔗 Coderjoe get out. :P
06:24 🔗 delta_sav you've made some tasty bash
06:26 🔗 delta_sav what's the best gide IUO
06:26 🔗 delta_sav **guide
06:26 🔗 Coderjoe guide to..?
06:26 🔗 delta_sav advanced bash
06:27 🔗 Coderjoe i dunno. I just figured it all out on my own with manpages and stuff
06:27 🔗 delta_sav I see a whole shit-ton of caveats i didnt know so I'm curious
06:28 🔗 delta_sav I do most of my quick-dev grunt work in bash... for said record
06:28 🔗 Coderjoe i've been doing bash stuff for 17 years or so, though the most advanced bash stuff (arrays and stuff) i only started doing in the past 7 or so
06:29 🔗 Coderjoe for me, it depends on what I need to do.
06:29 🔗 Coderjoe I've done quick grunt stuff in bash, perl, python, and php
06:29 🔗 delta_sav what do you use as a syntax ref?
06:29 🔗 Coderjoe man pages and trial and error?
06:30 🔗 delta_sav im bash/perl mostly, C for the fun stuff
06:30 🔗 Coderjoe and also a few in C (my day job is mostly C++)
06:30 🔗 Coderjoe C particularly if I don't need to do much string manipulation or things like database or the like
06:31 🔗 delta_sav " if(/" I've never seen, what is?
06:32 🔗 Coderjoe in the perl code? that's a regex match (the /sting/ part is)
06:32 🔗 delta_sav '//i' built in regex?
06:32 🔗 Coderjoe that's perl code
06:32 🔗 delta_sav nah its in bash
06:32 🔗 Coderjoe no it isn't
06:32 🔗 delta_sav if(/<a[^>]+href="([^"]+src[^"]+.jpg)"/i)
06:33 🔗 Coderjoe look at the lines above that... IMAGE=`cat file | perl -e '
06:33 🔗 Coderjoe it is a multiline bash script being passed as -e
06:35 🔗 Coderjoe er, multiline PERL script
06:35 🔗 delta_sav guess i dont get what while (<>) is
06:35 🔗 Coderjoe again, perl
06:36 🔗 Coderjoe loops through reading from standard input until end of file
06:36 🔗 delta_sav thought thats _$
06:36 🔗 Coderjoe into the variable $_
06:38 🔗 delta_sav ?
06:38 🔗 delta_sav erm, so in bash a while (<>) immediatly after the def loops throughL
06:38 🔗 delta_sav err, immediatly before
06:38 🔗 Coderjoe no, that while line is part of a PERL script
06:39 🔗 delta_sav oh shit its a backtick and a '
06:39 🔗 delta_sav lol nm
06:39 🔗 delta_sav I'm drunk, but do love archive team har
06:39 🔗 delta_sav sorry
06:44 🔗 delta_sav still dont get why no +~ tha
06:44 🔗 delta_sav **tho
06:44 🔗 delta_sav *****though
06:45 🔗 delta_sav erm, =~
06:45 🔗 delta_sav im sorry nm excuse me
06:47 🔗 Coderjoe another reason for rewriting it in python... it gets away from switching langauges in the middle a few times.
08:34 🔗 Coderjoe damn. 230gb behind already
15:38 🔗 Schbirid if anyone wants to leech that emuwiki torrent files from me tell me now, i will delete the directory tomorrow
16:27 🔗 Nemo_bis splinder.com closing, do you know?
16:28 🔗 Nemo_bis (they have about half a million blogs, I think, mostly or only in Italian)
16:34 🔗 alard When?
16:35 🔗 Nemo_bis 24 November, apparently
16:35 🔗 Nemo_bis it's something like 50 millions pages, they say
16:35 🔗 Nemo_bis I'm trying to understand where the date comes from
16:36 🔗 Nemo_bis there's no official announcement yet AFAIK
16:39 🔗 Nemo_bis delete spam -> http://archiveteam.org/index.php?title=Information
16:44 🔗 Nemo_bis ah, found the source for the date
16:44 🔗 alard Is there something on the wiki about splinder.com?
16:55 🔗 Nemo_bis I've just created the page http://archiveteam.org/index.php?title=Splinder
16:58 🔗 alard Good. I'm trying to download the list of users.
16:59 🔗 Nemo_bis ok
17:00 🔗 alard Then, if we're going to do this, we probably need to make a list of what users have.
17:00 🔗 Nemo_bis do you need any help with the language?
17:02 🔗 alard Well, the language I can manage, I can more or less decipher what it says. (And there's always the us version, right?)
17:02 🔗 alard But making a list of things they have would be useful.
17:02 🔗 alard Where do the 'ultimi commenti' come from?
17:14 🔗 alard Nemo_bis: Are you editing the wiki at the moment? If not, I'll have a go.
17:14 🔗 Nemo_bis alard, no, I'm not editing
17:15 🔗 alard Okay.
17:15 🔗 Nemo_bis hm, checking "ultimi commenti" (last comments)
17:15 🔗 alard It's probably sourced from the blog and other places, I guess, not a separate source of data.
17:16 🔗 Nemo_bis they're comments from all blogs
17:16 🔗 Nemo_bis they're shown at the bottom of each blog post
17:16 🔗 Nemo_bis but also separately as in http://www.splinder.com/myblog/comment/list/25742977
17:18 🔗 alard Ah, ok.
17:19 🔗 alard "I miei amici" => my friends, "Sono amico/a di" => friended by?
17:20 🔗 Nemo_bis "I'm friend of"
17:20 🔗 Nemo_bis but perhaps it's a status update? let me check
17:21 🔗 Nemo_bis looks like a simple list, you mean http://www.splinder.com/profile/zoestyle/friendof ?
17:24 🔗 alard Yes.
17:28 🔗 alard What is missing? http://www.archiveteam.org/index.php?title=Splinder#Example_URLs
17:30 🔗 Nemo_bis looking
17:31 🔗 alard Comments are missing. I'd like to find examples (of comments on a media item, for example, preferably so many that there is pagination).
17:32 🔗 alard Do you happen to have an account? Is there more information visible if you log in?
17:34 🔗 Nemo_bis yes, I was going to ask about comments
17:34 🔗 Nemo_bis no, I don't use splinder actually
17:35 🔗 Nemo_bis all comments seem to be available in the same format as above, http://www.splinder.com/myblog/comment/list/<postID>
17:37 🔗 Nemo_bis and for media it's e.g. http://www.splinder.com/media/comment/list/25744482
17:37 🔗 alard Great. "Spiacente, non puoi commentare questo post!" probably means 'sorry, you can't/can no longer comment on this post'?
17:37 🔗 Nemo_bis so it probably follows the same convention, with ?from=50 to see the next page etc.
17:37 🔗 Nemo_bis yes
17:38 🔗 Nemo_bis I've not found a way to increase the comments per page
17:38 🔗 alard Do you happen to have found an example link with the comments pagination?
17:39 🔗 Nemo_bis not yet
17:39 🔗 alard Not even on the blog?
17:39 🔗 alard (Where does the ?from=50 come from? Just a guess?)
17:41 🔗 Nemo_bis no, clicking the next page link
17:41 🔗 Nemo_bis found one: http://www.splinder.com/media/comment/list/21254470
17:41 🔗 Nemo_bis (first google result here: http://ur1.ca/5qe9w )
17:42 🔗 alard Wonderful. Not just a media item with comments, but a large one too.
17:44 🔗 Nemo_bis I don't see a way to get the item url from the comments feed
17:45 🔗 Nemo_bis but you're probably going to do it the other way round, I suppose
17:45 🔗 alard No, I was just looking if I could find that. The comment system is the same, though, you can replace /media/ with /myblog/ and you still get the same comments.
17:45 🔗 Nemo_bis ah
17:46 🔗 alard Any chance of finding a blog post with lots of comments?
17:46 🔗 Nemo_bis this explains why they don't have two series of ids
17:46 🔗 Nemo_bis isn't http://www.splinder.com/myblog/comment/list/25742977 ok?
17:47 🔗 Nemo_bis http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/
17:47 🔗 alard I'd like to have a blog link. That's useful.
17:47 🔗 Nemo_bis http://civati.splinder.com/post/25742977
17:47 🔗 Nemo_bis (this is probably one of the main blogs here, this person is quite famous)
17:48 🔗 alard That also tells us something about the url structure: with or without slugs at the end.
17:48 🔗 alard Even more interesting: it shows that not every blog has the comments on the page.
17:48 🔗 Nemo_bis yes :-/
17:49 🔗 alard Is there a way to get an example of a media item with comments. (Not the comment page, but the media page that links there.)
17:50 🔗 alard Oh, wait, never mind.
17:50 🔗 Nemo_bis googled the comments? :)
17:50 🔗 alard No, I just saw that the number of comments is listed on the media page. That's important information, since it saves a request to the comments page for most media items.
17:51 🔗 Nemo_bis it was www.splinder.com/mediablog/danspo/media/21254470 anyway
17:51 🔗 alard Although if you have an example, that's useful for testing.
17:51 🔗 alard Ah, thanks.
17:52 🔗 alard Now for someone with a lot of albums, to see the pagination there.
17:54 🔗 alard Although I'm not sure that's interesting to download, since the album info is already listed with the items.
17:54 🔗 alard What is 'condividi'?
17:56 🔗 Nemo_bis "share"
17:57 🔗 Nemo_bis I can't find any mediablog with lots of albums, still looking
17:59 🔗 alard Well, leave it. It's not that important.
18:00 🔗 alard What may be interesting is the video url.
18:00 🔗 alard http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_small.flv
18:00 🔗 alard _small suggests that there is something larger.
18:03 🔗 alard Ah, it seems it depends on the video.
18:03 🔗 alard http://files.splinder.com/e067653e1532e55ee208605fcb84361a.flv
18:03 🔗 alard Doesn't have a small.
18:04 🔗 Nemo_bis ah, found that the number of albums is limited for standard accounts, unlimited for pro
18:04 🔗 Nemo_bis they downscale bigger videos?
18:06 🔗 alard Not sure, haven't found a way to get anything other than _small.
18:06 🔗 alard Unless there is no _small, but then other urls are different too. It depends on the video.
18:07 🔗 alard Older videos are different.
18:08 🔗 alard Ah, that is awkward. It's not just videos, also the images (newer ones have a predictable structure: id_square.jpg, id_medium.jpg etc.) Older ones have different ids for small, large etc.
18:09 🔗 * Nemo_bis facepalms
18:11 🔗 alard Is there any way to get a larger profile picture?
18:12 🔗 Nemo_bis looking
18:17 🔗 alard Probably not.
18:17 🔗 alard I think the list is more or less complete: http://www.archiveteam.org/index.php?title=Splinder#Site_structure
18:17 🔗 Nemo_bis can't find any in help pages, blogs etc.
18:20 🔗 alard Ah, no, the audio.
18:26 🔗 alard That's interesting, audio thumbnails: http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef_thumbnail.mp3 http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef.mp3
18:29 🔗 Nemo_bis I've asked comments to splinder people...
18:30 🔗 alard Not the people *from* splinder, I hope?
18:30 🔗 alard :)
18:31 🔗 Nemo_bis no :)
18:32 🔗 Nemo_bis so the thumbnail is the same audio at 32 kb/s
18:34 🔗 alard Yes, I think that's the difference. The duration is the same.
18:34 🔗 alard Not all audio files have a thumbnail, by the way, older ones do not.
18:35 🔗 alard What's the point of http://bloggando.splinder.com/ ? Is that just a normal blog?
18:36 🔗 alard I guess it is, there's even a profile named 'bloggando', maybe something special by the company.
18:36 🔗 Nemo_bis it's a manual selection of posts by them
18:37 🔗 Nemo_bis they've also published some books
18:44 🔗 alard This comment http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/comment/65653358#cid-65653358
18:44 🔗 alard 'il settore blog', would that be just the blogs, or the complete user content on the site?
18:49 🔗 alard Later more.
18:53 🔗 Nemo_bis it means the "blog division" of the company; splinder is a subset of it
20:00 🔗 bsmith094 im trying to wget a webcomic site, how do i get around 403 forbidden? tried ignoring robots, didn't work, any suggestions?
20:00 🔗 Nemo_bis change user agent?
20:10 🔗 bsmith094 ok thanks, i just finally figured out the proper syntax for that, and apparently the site is only blocking googlebot from some python scripts
20:11 🔗 bsmith094 not the files so, woo, that worked! wget -U rocks
23:24 🔗 alard Who can help with an experiment?
23:25 🔗 alard Experiment is as follows: please git pull from https://github.com/ArchiveTeam/splinder-grab, get wget-warc, then see if you can download a profile from www.splinder.com
23:37 🔗 underscor Example profile name?
23:42 🔗 PepsiMax hey guys, 11/11/11
23:45 🔗 alard underscor: lowvoice

irclogger-viewer