Time |
Nickname |
Message |
00:41
🔗
|
underscor |
http://archiveofourown.org/works/258626 |
00:57
🔗
|
chronomex |
yes, yes. |
00:58
🔗
|
underscor |
chronomex: No way, you're beating me! |
00:58
🔗
|
underscor |
:'( |
02:08
🔗
|
underscor |
Coderjoe flatlined |
02:24
🔗
|
chronomex |
underscor: how the hell do I beat you |
02:31
🔗
|
underscor |
chronomex: download faster |
04:03
🔗
|
yipdw |
hm, that's annoying: you can plug in anything for the username in http://developer.berlios.de/devlog/username, and you'll get back 200 OK |
04:33
🔗
|
underscor |
uh oh, this wget-warc is using all my memory :( |
04:47
🔗
|
yipdw |
I get the feeling that BerliOS developer logs are pretty sparse |
04:47
🔗
|
yipdw |
I've checked 1,058 users so far and found 2 real devlogs |
04:54
🔗
|
yipdw |
curses, Paradoks is beating me on mobileme by a gigabyte! |
04:54
🔗
|
yipdw |
BUT NOT FOR LONG |
04:57
🔗
|
chronomex |
yeah. right. |
05:04
🔗
|
rude___ |
buh |
05:48
🔗
|
Coderjoe |
underscor: yeah... running out of disk space will do that |
05:57
🔗
|
Coderjoe |
I'm probably going to pull off of this node once I finish syncing stuff up |
05:57
🔗
|
delta_sav |
Hi |
05:59
🔗
|
delta_sav |
Is anyone running archives on 4chan other than ones submitted by channers? |
06:00
🔗
|
Coderjoe |
I was doing manual saves of threads on my own. I was considering writing somethign to automatically queue threads to be downloaded as well, but never got to it. and then I stopped visiting 4chan for the most part |
06:01
🔗
|
Coderjoe |
(and the auto-archiving scares me a bit due to CP posts) |
06:01
🔗
|
delta_sav |
shizer you're right |
06:02
🔗
|
Coderjoe |
captain picard can be rather troublesome |
06:02
🔗
|
Coderjoe |
I would rather not be vanned |
06:02
🔗
|
delta_sav |
could grab the images briefly for md5sum then store thaht |
06:03
🔗
|
delta_sav |
I could write something to get everything |
06:03
🔗
|
delta_sav |
actually, I could do all three |
06:03
🔗
|
delta_sav |
I should |
06:04
🔗
|
Coderjoe |
I've already got something to download everything in a thread. it might need some tweaks, however. |
06:04
🔗
|
delta_sav |
you in the habbit of sharing? |
06:04
🔗
|
Coderjoe |
I just needed to write another script that ran through the index pages and queue up new threads |
06:05
🔗
|
delta_sav |
it updates pretty quick but I could imagine something that wouldnt miss a therad |
06:05
🔗
|
delta_sav |
I think 4chan needs to be archived |
06:06
🔗
|
Coderjoe |
http://wegetsignal.org/raper.sh |
06:06
🔗
|
delta_sav |
couldnt archive the images as well though or the storage would be too much |
06:06
🔗
|
delta_sav |
best domain name ever |
06:06
🔗
|
Coderjoe |
it has a few things hard coded, like it likes to reside in ~4chan, looks at a textfile named raper.threads for urls to the thread pages, etc |
06:07
🔗
|
Coderjoe |
this downloads the thread page and the images and thumbnails |
06:07
🔗
|
delta_sav |
i'm looking at it |
06:07
🔗
|
Coderjoe |
I don't know if it ever worked on the flash board |
06:08
🔗
|
Coderjoe |
but it would also not delete images that got deleted on the server |
06:08
🔗
|
Coderjoe |
or even re-download them |
06:08
🔗
|
delta_sav |
curl not wegt? |
06:09
🔗
|
Coderjoe |
I like wget |
06:09
🔗
|
Coderjoe |
and I make use if the -i parameter a lot |
06:10
🔗
|
delta_sav |
lol if this is an introduction I should say I've been doing data scraping for the last 6 months but just quit my job as a corparate whore :p |
06:10
🔗
|
Coderjoe |
the UA string at the top should give you a bit of an idea how long ago I wrote the script |
06:10
🔗
|
delta_sav |
my ua these days is "internet ready toaster oven" |
06:10
🔗
|
Coderjoe |
hell.. it even mentions the 4chan server named "img" which doesn't even exist anymore |
06:11
🔗
|
Coderjoe |
along with a workaround for img not returning 404 when a thread died |
06:11
🔗
|
Coderjoe |
er |
06:11
🔗
|
Coderjoe |
no img was, the others were not |
06:12
🔗
|
delta_sav |
not use to while(<>) in bash |
06:12
🔗
|
delta_sav |
errthing i take it? |
06:13
🔗
|
bbot_ |
delta_sav: http://chanarchive.org/ |
06:13
🔗
|
bbot_ |
also http://archive.no-ip.org/ |
06:13
🔗
|
Coderjoe |
i pretty much just ran this in a "while /bin/true; do blah; sleep; done" loop |
06:13
🔗
|
delta_sav |
no FAQ and more gives internal server error |
06:13
🔗
|
bbot_ |
though neither of them redistribute archives, which is a shame |
06:13
🔗
|
delta_sav |
:{ |
06:14
🔗
|
delta_sav |
4chan just may be the easiest way for anyone to say anything, which means it's prolly the most important thing to archive IMO |
06:14
🔗
|
bbot_ |
maybe |
06:15
🔗
|
Coderjoe |
it might be better to rewrite in python or something, with a database for the thread queue |
06:16
🔗
|
delta_sav |
chanarchive.org looks solid but who are they? |
06:16
🔗
|
delta_sav |
how do you join/help |
06:17
🔗
|
delta_sav |
4chan is busy but not THAT busy, bash will do |
06:17
🔗
|
delta_sav |
bash -> mysql |
06:17
🔗
|
delta_sav |
to lamp for frontend for rest of internet personell |
06:19
🔗
|
Coderjoe |
my bash script is already a big hack. adding a database does not seem like a good thing to do. |
06:19
🔗
|
Coderjoe |
(in bash) |
06:20
🔗
|
delta_sav |
eventually it'll get pretty big, I'm not sure thats even an "eventually" |
06:20
🔗
|
Coderjoe |
the python wasn't about speed, but stability and readability. I could add an HTMLParser that properly handled the img and a tags, for example. it would be a lot cleaner and less fragile than the perl blob in the middle of that bash file |
06:21
🔗
|
Coderjoe |
er |
06:21
🔗
|
Coderjoe |
s/readability/reliability/ |
06:21
🔗
|
Coderjoe |
stupid brain |
06:21
🔗
|
delta_sav |
heheheheheheh, not readable for me I'm from the land of C |
06:22
🔗
|
Coderjoe |
if you're a decent programmer, it shouldn't be difficult to read stuff written in most langauges |
06:22
🔗
|
delta_sav |
read no |
06:22
🔗
|
delta_sav |
write, it gets a lil tricky |
06:23
🔗
|
Coderjoe |
python makes it so much easier to whip up quick scripts to do complex things. |
06:23
🔗
|
Coderjoe |
you don't have to make them all OOP and everything if you don't want to |
06:23
🔗
|
delta_sav |
no |
06:23
🔗
|
delta_sav |
fuck OOP |
06:24
🔗
|
Coderjoe |
get out. :P |
06:24
🔗
|
delta_sav |
you've made some tasty bash |
06:26
🔗
|
delta_sav |
what's the best gide IUO |
06:26
🔗
|
delta_sav |
**guide |
06:26
🔗
|
Coderjoe |
guide to..? |
06:26
🔗
|
delta_sav |
advanced bash |
06:27
🔗
|
Coderjoe |
i dunno. I just figured it all out on my own with manpages and stuff |
06:27
🔗
|
delta_sav |
I see a whole shit-ton of caveats i didnt know so I'm curious |
06:28
🔗
|
delta_sav |
I do most of my quick-dev grunt work in bash... for said record |
06:28
🔗
|
Coderjoe |
i've been doing bash stuff for 17 years or so, though the most advanced bash stuff (arrays and stuff) i only started doing in the past 7 or so |
06:29
🔗
|
Coderjoe |
for me, it depends on what I need to do. |
06:29
🔗
|
Coderjoe |
I've done quick grunt stuff in bash, perl, python, and php |
06:29
🔗
|
delta_sav |
what do you use as a syntax ref? |
06:29
🔗
|
Coderjoe |
man pages and trial and error? |
06:30
🔗
|
delta_sav |
im bash/perl mostly, C for the fun stuff |
06:30
🔗
|
Coderjoe |
and also a few in C (my day job is mostly C++) |
06:30
🔗
|
Coderjoe |
C particularly if I don't need to do much string manipulation or things like database or the like |
06:31
🔗
|
delta_sav |
" if(/" I've never seen, what is? |
06:32
🔗
|
Coderjoe |
in the perl code? that's a regex match (the /sting/ part is) |
06:32
🔗
|
delta_sav |
'//i' built in regex? |
06:32
🔗
|
Coderjoe |
that's perl code |
06:32
🔗
|
delta_sav |
nah its in bash |
06:32
🔗
|
Coderjoe |
no it isn't |
06:32
🔗
|
delta_sav |
if(/<a[^>]+href="([^"]+src[^"]+.jpg)"/i) |
06:33
🔗
|
Coderjoe |
look at the lines above that... IMAGE=`cat file | perl -e ' |
06:33
🔗
|
Coderjoe |
it is a multiline bash script being passed as -e |
06:35
🔗
|
Coderjoe |
er, multiline PERL script |
06:35
🔗
|
delta_sav |
guess i dont get what while (<>) is |
06:35
🔗
|
Coderjoe |
again, perl |
06:36
🔗
|
Coderjoe |
loops through reading from standard input until end of file |
06:36
🔗
|
delta_sav |
thought thats _$ |
06:36
🔗
|
Coderjoe |
into the variable $_ |
06:38
🔗
|
delta_sav |
? |
06:38
🔗
|
delta_sav |
erm, so in bash a while (<>) immediatly after the def loops throughL |
06:38
🔗
|
delta_sav |
err, immediatly before |
06:38
🔗
|
Coderjoe |
no, that while line is part of a PERL script |
06:39
🔗
|
delta_sav |
oh shit its a backtick and a ' |
06:39
🔗
|
delta_sav |
lol nm |
06:39
🔗
|
delta_sav |
I'm drunk, but do love archive team har |
06:39
🔗
|
delta_sav |
sorry |
06:44
🔗
|
delta_sav |
still dont get why no +~ tha |
06:44
🔗
|
delta_sav |
**tho |
06:44
🔗
|
delta_sav |
*****though |
06:45
🔗
|
delta_sav |
erm, =~ |
06:45
🔗
|
delta_sav |
im sorry nm excuse me |
06:47
🔗
|
Coderjoe |
another reason for rewriting it in python... it gets away from switching langauges in the middle a few times. |
08:34
🔗
|
Coderjoe |
damn. 230gb behind already |
15:38
🔗
|
Schbirid |
if anyone wants to leech that emuwiki torrent files from me tell me now, i will delete the directory tomorrow |
16:27
🔗
|
Nemo_bis |
splinder.com closing, do you know? |
16:28
🔗
|
Nemo_bis |
(they have about half a million blogs, I think, mostly or only in Italian) |
16:34
🔗
|
alard |
When? |
16:35
🔗
|
Nemo_bis |
24 November, apparently |
16:35
🔗
|
Nemo_bis |
it's something like 50 millions pages, they say |
16:35
🔗
|
Nemo_bis |
I'm trying to understand where the date comes from |
16:36
🔗
|
Nemo_bis |
there's no official announcement yet AFAIK |
16:39
🔗
|
Nemo_bis |
delete spam -> http://archiveteam.org/index.php?title=Information |
16:44
🔗
|
Nemo_bis |
ah, found the source for the date |
16:44
🔗
|
alard |
Is there something on the wiki about splinder.com? |
16:55
🔗
|
Nemo_bis |
I've just created the page http://archiveteam.org/index.php?title=Splinder |
16:58
🔗
|
alard |
Good. I'm trying to download the list of users. |
16:59
🔗
|
Nemo_bis |
ok |
17:00
🔗
|
alard |
Then, if we're going to do this, we probably need to make a list of what users have. |
17:00
🔗
|
Nemo_bis |
do you need any help with the language? |
17:02
🔗
|
alard |
Well, the language I can manage, I can more or less decipher what it says. (And there's always the us version, right?) |
17:02
🔗
|
alard |
But making a list of things they have would be useful. |
17:02
🔗
|
alard |
Where do the 'ultimi commenti' come from? |
17:14
🔗
|
alard |
Nemo_bis: Are you editing the wiki at the moment? If not, I'll have a go. |
17:14
🔗
|
Nemo_bis |
alard, no, I'm not editing |
17:15
🔗
|
alard |
Okay. |
17:15
🔗
|
Nemo_bis |
hm, checking "ultimi commenti" (last comments) |
17:15
🔗
|
alard |
It's probably sourced from the blog and other places, I guess, not a separate source of data. |
17:16
🔗
|
Nemo_bis |
they're comments from all blogs |
17:16
🔗
|
Nemo_bis |
they're shown at the bottom of each blog post |
17:16
🔗
|
Nemo_bis |
but also separately as in http://www.splinder.com/myblog/comment/list/25742977 |
17:18
🔗
|
alard |
Ah, ok. |
17:19
🔗
|
alard |
"I miei amici" => my friends, "Sono amico/a di" => friended by? |
17:20
🔗
|
Nemo_bis |
"I'm friend of" |
17:20
🔗
|
Nemo_bis |
but perhaps it's a status update? let me check |
17:21
🔗
|
Nemo_bis |
looks like a simple list, you mean http://www.splinder.com/profile/zoestyle/friendof ? |
17:24
🔗
|
alard |
Yes. |
17:28
🔗
|
alard |
What is missing? http://www.archiveteam.org/index.php?title=Splinder#Example_URLs |
17:30
🔗
|
Nemo_bis |
looking |
17:31
🔗
|
alard |
Comments are missing. I'd like to find examples (of comments on a media item, for example, preferably so many that there is pagination). |
17:32
🔗
|
alard |
Do you happen to have an account? Is there more information visible if you log in? |
17:34
🔗
|
Nemo_bis |
yes, I was going to ask about comments |
17:34
🔗
|
Nemo_bis |
no, I don't use splinder actually |
17:35
🔗
|
Nemo_bis |
all comments seem to be available in the same format as above, http://www.splinder.com/myblog/comment/list/<postID> |
17:37
🔗
|
Nemo_bis |
and for media it's e.g. http://www.splinder.com/media/comment/list/25744482 |
17:37
🔗
|
alard |
Great. "Spiacente, non puoi commentare questo post!" probably means 'sorry, you can't/can no longer comment on this post'? |
17:37
🔗
|
Nemo_bis |
so it probably follows the same convention, with ?from=50 to see the next page etc. |
17:37
🔗
|
Nemo_bis |
yes |
17:38
🔗
|
Nemo_bis |
I've not found a way to increase the comments per page |
17:38
🔗
|
alard |
Do you happen to have found an example link with the comments pagination? |
17:39
🔗
|
Nemo_bis |
not yet |
17:39
🔗
|
alard |
Not even on the blog? |
17:39
🔗
|
alard |
(Where does the ?from=50 come from? Just a guess?) |
17:41
🔗
|
Nemo_bis |
no, clicking the next page link |
17:41
🔗
|
Nemo_bis |
found one: http://www.splinder.com/media/comment/list/21254470 |
17:41
🔗
|
Nemo_bis |
(first google result here: http://ur1.ca/5qe9w ) |
17:42
🔗
|
alard |
Wonderful. Not just a media item with comments, but a large one too. |
17:44
🔗
|
Nemo_bis |
I don't see a way to get the item url from the comments feed |
17:45
🔗
|
Nemo_bis |
but you're probably going to do it the other way round, I suppose |
17:45
🔗
|
alard |
No, I was just looking if I could find that. The comment system is the same, though, you can replace /media/ with /myblog/ and you still get the same comments. |
17:45
🔗
|
Nemo_bis |
ah |
17:46
🔗
|
alard |
Any chance of finding a blog post with lots of comments? |
17:46
🔗
|
Nemo_bis |
this explains why they don't have two series of ids |
17:46
🔗
|
Nemo_bis |
isn't http://www.splinder.com/myblog/comment/list/25742977 ok? |
17:47
🔗
|
Nemo_bis |
http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/ |
17:47
🔗
|
alard |
I'd like to have a blog link. That's useful. |
17:47
🔗
|
Nemo_bis |
http://civati.splinder.com/post/25742977 |
17:47
🔗
|
Nemo_bis |
(this is probably one of the main blogs here, this person is quite famous) |
17:48
🔗
|
alard |
That also tells us something about the url structure: with or without slugs at the end. |
17:48
🔗
|
alard |
Even more interesting: it shows that not every blog has the comments on the page. |
17:48
🔗
|
Nemo_bis |
yes :-/ |
17:49
🔗
|
alard |
Is there a way to get an example of a media item with comments. (Not the comment page, but the media page that links there.) |
17:50
🔗
|
alard |
Oh, wait, never mind. |
17:50
🔗
|
Nemo_bis |
googled the comments? :) |
17:50
🔗
|
alard |
No, I just saw that the number of comments is listed on the media page. That's important information, since it saves a request to the comments page for most media items. |
17:51
🔗
|
Nemo_bis |
it was www.splinder.com/mediablog/danspo/media/21254470 anyway |
17:51
🔗
|
alard |
Although if you have an example, that's useful for testing. |
17:51
🔗
|
alard |
Ah, thanks. |
17:52
🔗
|
alard |
Now for someone with a lot of albums, to see the pagination there. |
17:54
🔗
|
alard |
Although I'm not sure that's interesting to download, since the album info is already listed with the items. |
17:54
🔗
|
alard |
What is 'condividi'? |
17:56
🔗
|
Nemo_bis |
"share" |
17:57
🔗
|
Nemo_bis |
I can't find any mediablog with lots of albums, still looking |
17:59
🔗
|
alard |
Well, leave it. It's not that important. |
18:00
🔗
|
alard |
What may be interesting is the video url. |
18:00
🔗
|
alard |
http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_small.flv |
18:00
🔗
|
alard |
_small suggests that there is something larger. |
18:03
🔗
|
alard |
Ah, it seems it depends on the video. |
18:03
🔗
|
alard |
http://files.splinder.com/e067653e1532e55ee208605fcb84361a.flv |
18:03
🔗
|
alard |
Doesn't have a small. |
18:04
🔗
|
Nemo_bis |
ah, found that the number of albums is limited for standard accounts, unlimited for pro |
18:04
🔗
|
Nemo_bis |
they downscale bigger videos? |
18:06
🔗
|
alard |
Not sure, haven't found a way to get anything other than _small. |
18:06
🔗
|
alard |
Unless there is no _small, but then other urls are different too. It depends on the video. |
18:07
🔗
|
alard |
Older videos are different. |
18:08
🔗
|
alard |
Ah, that is awkward. It's not just videos, also the images (newer ones have a predictable structure: id_square.jpg, id_medium.jpg etc.) Older ones have different ids for small, large etc. |
18:09
🔗
|
* |
Nemo_bis facepalms |
18:11
🔗
|
alard |
Is there any way to get a larger profile picture? |
18:12
🔗
|
Nemo_bis |
looking |
18:17
🔗
|
alard |
Probably not. |
18:17
🔗
|
alard |
I think the list is more or less complete: http://www.archiveteam.org/index.php?title=Splinder#Site_structure |
18:17
🔗
|
Nemo_bis |
can't find any in help pages, blogs etc. |
18:20
🔗
|
alard |
Ah, no, the audio. |
18:26
🔗
|
alard |
That's interesting, audio thumbnails: http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef_thumbnail.mp3 http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef.mp3 |
18:29
🔗
|
Nemo_bis |
I've asked comments to splinder people... |
18:30
🔗
|
alard |
Not the people *from* splinder, I hope? |
18:30
🔗
|
alard |
:) |
18:31
🔗
|
Nemo_bis |
no :) |
18:32
🔗
|
Nemo_bis |
so the thumbnail is the same audio at 32 kb/s |
18:34
🔗
|
alard |
Yes, I think that's the difference. The duration is the same. |
18:34
🔗
|
alard |
Not all audio files have a thumbnail, by the way, older ones do not. |
18:35
🔗
|
alard |
What's the point of http://bloggando.splinder.com/ ? Is that just a normal blog? |
18:36
🔗
|
alard |
I guess it is, there's even a profile named 'bloggando', maybe something special by the company. |
18:36
🔗
|
Nemo_bis |
it's a manual selection of posts by them |
18:37
🔗
|
Nemo_bis |
they've also published some books |
18:44
🔗
|
alard |
This comment http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/comment/65653358#cid-65653358 |
18:44
🔗
|
alard |
'il settore blog', would that be just the blogs, or the complete user content on the site? |
18:49
🔗
|
alard |
Later more. |
18:53
🔗
|
Nemo_bis |
it means the "blog division" of the company; splinder is a subset of it |
20:00
🔗
|
bsmith094 |
im trying to wget a webcomic site, how do i get around 403 forbidden? tried ignoring robots, didn't work, any suggestions? |
20:00
🔗
|
Nemo_bis |
change user agent? |
20:10
🔗
|
bsmith094 |
ok thanks, i just finally figured out the proper syntax for that, and apparently the site is only blocking googlebot from some python scripts |
20:11
🔗
|
bsmith094 |
not the files so, woo, that worked! wget -U rocks |
23:24
🔗
|
alard |
Who can help with an experiment? |
23:25
🔗
|
alard |
Experiment is as follows: please git pull from https://github.com/ArchiveTeam/splinder-grab, get wget-warc, then see if you can download a profile from www.splinder.com |
23:37
🔗
|
underscor |
Example profile name? |
23:42
🔗
|
PepsiMax |
hey guys, 11/11/11 |
23:45
🔗
|
alard |
underscor: lowvoice |