[00:00] once again I find myself signing up for a dying platform. gotta love it [00:00] you can't sign up [00:00] the signup page is broken [00:00] lol [00:01] however... http://www.bugmenot.com/view/posterous.com [00:01] I just reset the password of the first one to what's listed on bugmenot [00:02] hmm [00:03] there's an "id" key [00:03] I wonder if that's serial [00:03] and an api call to retrieve by id [00:03] yeah it seems serial [00:03] GET sites/:id [00:03] closure: you paying attention? [00:04] weird, the highest id that works is 299999 [00:04] heh [00:04] hmm no [00:04] 300001 works too [00:04] missing users [00:04] so you have a list? [00:05] no, I'm using the api at https://posterous.com/api [00:05] after logging in with bugmenot credentials [00:06] estimated about 10 million possibly vallid ids [00:08] so you should be able to whip up a program to make a list [00:08] right? [00:10] I'd think so. It's not clear to me how their API uses the api token [00:10] see https://posterous.com/api/docs/pages/overview [00:11] 10k per day rate limit [00:11] :/ [00:12] only 100 days! [00:12] would need quite a lot of accounts, at least 15 or so working for months [00:12] and we can always make more keys [00:12] how? [00:12] lots of accounts? [00:12] account creation is disabled [00:12] ahh [00:13] emailing them a post still seems to open an account [00:13] wheee [00:13] really? [00:13] nice [00:13] http://joey-toqiv.posterous.com/post [00:13] oh really? [00:13] what email address? [00:14] post@posterous.com ? [00:14] Remember that they have a shortener called Post.ly [00:14] yes [00:15] use the URL shortner project to get a list of URLs, and therefore users [00:15] good thinking [00:15] who has the giant url archive? swebb? [00:15] URLTeam does not have any Post.ly links for what I know [00:15] k [00:15] weird [00:15] Yes, swebb's got links from the tweethose [00:16] 2 birds [00:16] so if there's any post.ly links, he has a few. [00:16] bird channel [00:16] Two birds, one cup. [00:16] ogod [00:17] If we start getting post.ly URLs now (How hard is it to add a new domain?) then after a few days we should have a decent list of usernames to start working with [00:19] curl -X GET --user bugmenot@trash-mail.com:bugmenot23 -d "api_token=JAHeFmHJldwlsrlyHcutDEBBhvhDvFbt" http://posterous.com/api/2/sites/99999 [00:20] well, that works.. json parsing and looping 1 to 10000 left [00:21] also they have < 10800000 sites [00:23] creating an account by email seems to take me to https://posterous.com/register?flow=newcomment [00:23] closure: based on? [00:25] escatly 10793724, based on bisection [00:26] be warned, there are some invalid/deleted ones [00:26] closure: does opening an account by email actually /work/? [00:26] hmm [00:26] the iOS app is still alive [00:26] * balrog_ grabs it [00:31] hmm, posting by email seems to make a site, but you still have to sign up to claim it [00:31] * closure tries email password recovery [00:31] nope [00:31] I'm going to try ios account creation [00:34] YES [00:34] you can sign up from the iphone app [00:34] "Thanks for using Posterous Spaces! Please click the link below to confirm your email address." [00:34] lol [00:34] awesome [00:35] * closure rushes out to buy an iphone. 100 iphones [00:35] you only need one [00:35] do they have an android app [00:35] or an ipad [00:35] or an ipod touch [00:35] I want to say yes [00:35] https://play.google.com/store/apps/details?id=com.Posterous&hl=en [00:35] aaah [00:35] that's not gonna help [00:35] you have to request an API key [00:36] and all requests appear to be reviewed? huh [00:36] suck [00:36] nooo.. really? [00:36] I used the bugmenot account, went to the api page, there's a place to click to see the key [00:36] it's possible someone created it with a different email [00:36] then changed it to bugmenot after the api key was generated [00:37] so you don't have view token on http://posterous.com/api ? [00:40] when I click it, it says: "To gain access to the Posterous Spaces API, please submit a request via our API request form." [00:40] with the form linked [00:40] the form asks for your name, email, and phone number, and why you want to use the API. [00:40] I'm sure you can write something interesting there ;) [00:41] what's funny is that the send request buttons on the page all work [00:43] oh, in that case.. load it up in firebug or chrome dev console, you can see it make the request there and get key [00:43] it may be using the cookie there [00:43] but it is making a GET request on the API [00:43] yeah that's what it appears to be doing [00:43] using cookie-auth [00:44] possibly referer-checking and such as well [00:44] look for api_token= [00:44] winrar [00:44] there isn't any [00:44] I'm looking at the request in wireshark [00:45] huh [00:45] now I'm getting "error": "Unauthorized You are not authorized to view this site." [00:45] oh wait [00:45] nvm [00:46] yeah it's using an XMLHttpRequest [00:46] and I see no api key [00:47] X-RateLimit-Remaining doesn't seem to be decrementing though [00:48] there's a cookie [00:49] ooh, that's interesting [00:49] perhaps they forgot to rate limit this way? [00:49] what's up [00:50] it's possible [00:52] curl -H "X-Requested-With: XMLHttpRequest" -H "X-Xhrsource: posterous" -X GET -H "Cookie: _sharebymail_session_id=e55e807375f457efa9a22e091c0685c7; email=bugmenot%40trash-mail.com; _plogin=Veritas; logged_in_before=true" http://posterous.com/api/2/sites/107 [00:52] try that [00:52] erm, assuming this is not our only usable api :) [00:52] http://hastebin.com/finekoveva [00:53] closure: hm? [00:53] ok, so they also limit by IP, because that worked for me [00:54] so that didn't work? [00:54] for you? [00:54] see the hastebin [00:54] oh, NM, I misread it. ok. [00:54] well, until they notice us and ban.. let's see if the request count is going down now [00:55] I suggest creating like 100 accounts [00:56] hmm, it seems your curl logged me out ;) [00:56] LOL [00:56] well they're clearly vulnerable to the firesheep exploit [00:57] ah, no, I misunderstood what curl -I does [00:57] ohh [00:57] well, the api rate limit is *not* going down [00:57] hah! [00:57] so.. time to put some load on the servers people [00:57] balrog_: firesheep author is a friend of mine ;) [00:57] * closure claims first million [00:58] hahaha [00:58] you would [01:00] just ran the afore-pasted curl 3 times here, btw. seemed to work fine. [01:00] so, I'm running this: for s in $(seq 1 1000000); do curl -H "X-Requested-With: XMLHttpRequest" -H "X-Xhrsource: posterous" -X GET -H "Cookie: _sharebymail_session_id=e55e807375f457efa9a22e091c0685c7; email=bugmenot%40trash-mail.com; _plogin=Veritas; logged_in_before=true" http://posterous.com/api/2/sites/$s >| site.$s; done [01:00] can pull the user urls out of the files later [01:01] have 300 done, no sign of the rate limit header going down [01:02] the apps probably use the API too and have their own platform API keys [01:02] it's not exactly nice but we could rip them out :) [01:03] seems you don't need a key, just a login cookie [01:03] so loging with bugmenot, get the cookie [01:05] haha [01:08] balrog_: i suggest using HTTP HEAD to determine the existence of pages, instead of GET. [01:11] I'm running 100 concurrent api grabbers now. Seems to be working. [01:12] hah, they rename spam sites to $site-banned.posterous.com [01:13] 20k users snarfed [01:13] closure, should i start another range, or have you got this covered? [01:13] I think I have 1-1 million [01:14] just getting usernames, not downloading any site [01:14] righto. [01:15] which will be around a 4 gb download itself [01:23] 10793855 is the highest number that seems to give me results, though if there are gaps i may be mucking things up a bit. [01:23] there are gaps but that seems to be a fairly good ballpark estimate [01:23] we need a new irc channel for this [01:24] GENERIC MESSAGE ABOUT POSTEROUS [01:24] EcapsCore, GENERIC "WE'RE ON IT" RESPONSE [01:24] and thanks :) [01:24] GENERIC "OKAY" RESPONSE [01:24] or rather, closure / balrog_ / other_smart_people are on it. [01:25] Think we'll be able to do it? [01:25] I'm still mad about nbc's handling of everyblock [01:25] it's like they deliberately did it to prevent archival [01:26] http://www.wbez.org/blogs/britt-julious/2013-02/being-here-vs-living-here-why-everyblock-mattered-105550 [01:27] need about 10 other people to run this on a million each: http://pastebin.com/4ka5niDy [01:28] I can run it, but we need to keep order [01:28] wiki? [01:28] btw, that session_id might expire in an hour or who knows [01:30] Gah, horrible mobile IRC clients. [01:32] * closure has 112 thousand sites already [01:32] so, until they ban me, I guess it's going pretty well ;) [01:32] ;) [01:33] shall i start a table on the wiki? [01:33] yes please [01:33] once we have all names, how do we download sites? [01:34] probably one of the new fancy warrior jobs [01:36] so is there a page or something that details what people have done already? [01:36] it's just downloading names right? [01:36] right, names and a few other bits and bobs [01:38] ugh. http://archiveteam.org/index.php?title=Posterous [01:38] wikis. [01:39] filling in all ranges now... [01:39] I'll grab 7m [01:39] if you don't like the wiki, just use some collaborative editing tool online [01:40] I'll do 1-2m unless it's already done (example makes me think so) [01:41] please change the script to make curl not dump anything to the screen [01:41] ranges 1 - 11,000,000 on the wiki for claiming. [01:42] I have 7m [01:42] ok. will mark it for balrog_ . [01:42] I'll grab 9m [01:42] got it NotGLaDOS [01:43] can someone please make the script not spit out so much crap to console? [01:43] add -s to the curl line [01:45] curl -sH ? [01:45] curl -s -H ... [01:49] hah [01:49] it's not going very fast for me, 5589 so far [01:49] and load averages: 95.54 79.33 41.68 [01:49] o_O [01:49] Nah, you'll be alright aye [01:49] here is a simple data processor to get a list of sites found: http://pastebin.com/YpLXBb1w [01:50] hmm, my load is only 50 or so [01:50] what speed are you getting? [01:50] 1 million may take too long [01:50] and/or the cookie may expire [01:50] well, it happens to be round-tripping all the way to the UK, and I have 206000 done [01:51] this box is IO-starved :/ [01:51] granted, this is a well-connected server [01:51] my connection is fine [01:51] my IO is shit [01:51] I need something with eSATA [01:51] it'd be possible to hook the curl output up to the perl script and then you just get a list of sites [01:51] right now it's running off a USB disk [01:56] how am I supposed to run it? for chunk in $(seq 100 199); do ./snarf $chunk &; done ? so I'm guessing the file should be called snarf.sh? [01:56] yeah [01:57] well, snarf [01:57] Yay, dos line endings interfering with the script! [01:58] so, it's not just me- I feel a lot better now [01:58] yay appropriate irc nick [01:59] Ugh, forgot how to change a file encoding in vim [01:59] :%s/\r//g maybe [02:00] apparently the tools are now called tofromdos , and you call fromdos or todos [02:01] "Pattern not found: Y0LOSWAG" [02:01] Wut [02:08] ok, they're overloaded I think [02:08] getting connect fails [02:08] or I'm banned [02:08] getting fails here too [02:09] I'm getting connect hangs [02:09] actually no [02:09] it's still working here [02:13] I have a better (less disk IO script) [02:13] will it pick up if I kill this one? [02:14] no [02:14] well, I could make it [02:14] one sec [02:14] Going to wait for that script befire [02:15] I start. Last time I had a high disk IO (feeding 103 books to my bot), my host yelled at me [02:16] hmm. doing about 2000 per minute here... load avg 2. [02:16] scratch that, 4 :) [02:17] http://pastebin.com/ZJ1WEi56 [02:17] load averages: 93.97 94.47 86.21 [02:18] named in honor of my recently meeting Kryton. nice guy ;) [02:27] my ip is 100% banned. I have 4 other IPs here.. was there a way to make curl url a different one? [02:28] can't you just ask them for a dump [02:29] so you can't even access their website? [02:30] nope [02:30] wow [02:31] I guess someone notices the .24 million hits [02:31] 33633 grabbed before stop [02:32] smeg will replay the site.* files you got and resume [02:32] running it. [02:32] not as fast as I'd like because of I/O contention [02:34] yeah, it'll be IO-y when resuming [02:34] * closure guesses that aroud 99% of powerous sites are spam [02:35] typoy tonigt [02:35] lol why you think that? [02:35] just looking at all the iphonefoo and dealershipsblah names [02:35] how do I know when they ban my IP? [02:36] maybe I should not let them do this? [02:37] 432175 iphone-wodq.posterous.com [02:37] 432177 iphone-sylf.posterous.com [02:37] 432179 iphone-afii.posterous.com [02:39] 522173 idxf0d1mnl-banned.posterous.com [02:39] 522175 o6nuzdet0g-banned.posterous.com [02:39] 522177 bgv23e4mls-banned.posterous.com [02:39] 522179 ik1gtjuxyi-banned.posterous.com [02:39] lol [02:39] I have pages of that [02:39] cat *.hostnames | wc -l gives me 33028 [02:43] btw what happened with xanga-grab? [02:49] closure: http://www.sysadminvalley.com/2009/06/29/curl-requests-by-binding-to-different-ip-address/ [02:49] see man page [02:53] 950831 best-price-for-nexium-generic-banned-banned.posterous.com [02:53] double banned! [02:54] :D [02:54] someone really likes their banhammer [02:55] cool, got around the ban [02:55] yeah [02:56] hmm, the numbers stopped advancing [02:56] that's ban time, probably [02:56] wait, they started again [02:57] but that may not mean much [02:57] or it's a little overloaded maybe? [02:57] how do I check if it's valid? [02:57] the hostname files will only grow if it's fail [02:58] tail -f *.hostnames [02:58] er, grow if it's valid [02:59] we is banned [03:00] cat *.hostnames | wc -l : 39462 [03:00] :\ [03:00] closure: add some throttling [03:01] curl http://posterous.com [03:01] curl: (7) couldn't connect to host [03:02] well, you could run less than 100 at a time [03:02] maybe they won't ban at 10 at a time or something [03:07] so does it really start at 100100 or is that a mistake? if it's supposed to start at 100000, could someone grab those first ones? [03:08] dashcloud: you did a million already? [03:08] oh, I'm getting from 1, don't worry [03:08] I ran this: for chunk in $(seq 100 199); do ./snarf $chunk &; done [03:08] wait- I think I know what happened [03:09] I think I botched part of the download for 1-2 million [03:09] I'm curious how you evaded the IP ban [03:09] probably because I didn't actually download a million [03:10] looks like I had done 10k or so [03:10] switch to smeg, it'll resume [03:10] they banned me at about 40k [03:11] I think I edited the snarf.sh and accidentally shrunk the range [03:16] this posterous thing is a nice preview of how much twitter cares about preserving all their data, btw.. [03:20] they've got it preserved- you just can't get to it unless you've got $$$$$$ bucks, and an "appropriate" business plan [03:20] you can't do analytics and data-mining stuff without an extensive archive [03:21] well, I just found my 250000'th hostname [03:23] * closure thinks it's hilarous that they have an api key with such an easily bypassed rate limit. wonder how common that is? [03:26] fwiw, they seem to have around 25 boxes in the cluster handling these api calls, based on some headers [03:26] closure: lmk if you figure a way around the ip-block [03:26] well, if you have more IP addresses, I do [03:27] I don't :/ [03:27] well I do but I don't want to waste them,,, the banned one was my primary [03:29] there's always EC2 [03:29] looks like SketchCow got some computer power magazines uploaded: http://archive.org/details/computer_shopper [03:38] I've made a better smeg that automatically resumes from the *.hostnames files when re-run. So you can move the files elsewhere or give them to someone else to continue. [03:38] http://pastebin.com/VUvydX0q [03:39] hmm, may be the ugliest for loop I've ever written in shell [03:39] well someone wants these? [03:39] ew [03:39] I'll take them [03:39] if you're done [03:40] I have 39462 grabbed and I'm banned [03:41] closure: see pm [03:44] you sure got banned before many were done.. I'll bet it's not automatic, just they noticed you [03:44] * closure reserves an ip for the 3 am run ;) [03:47] here's an interesting project I found: https://github.com/calufa/tales-core [03:47] block-tolerant scraper [03:49] I'm going to remove my name from my blocks, because I won't be around this week to babysit them [03:51] we can always go full-on-warrier job just to get the hostnames if it comes to it [03:52] ... and take a month or whatever [03:52] I'd worry they might fix their broken rate limit before done [03:55] whups 2 more ips banned [04:02] whack-a-mole [04:02] try rate-limiting? [04:02] spose I ougt to [04:02] maybe it's time for the week-long slow burn? [04:03] i mean... we have six whole weeks... how "generous" of them. [04:03] gotta get the data too [04:03] if they rate limit just hostnames ...... [04:04] otoh, this is probably running on some barely scaled part of their architecture [04:08] I have over half a million hostnames if someone wants to start the site mirroring BTW [04:28] hey. what's going on. [04:29] closure, i'm going to keep at my range overnight... one measly thread on a new IP... just to see what happens. [04:33] k.. I have an ip that's running only 10 threads, also to see [04:33] hmm. 'bout 80 per minute with one thread. ten days to finish a block of 1 million. [04:53] not bad [04:57] 100 per minute now. 7 days. [04:57] though i guess that's not counting misses at all, so it may be a bit faster. [05:00] poop. banned with one thread after just 2000 or so requests. [05:01] wow [05:01] or i'm just hitting a big dry patch. [05:01] yeah, me also banned [05:02] so, admins are sitting at the console with a red bull in one hand and a banhammer in the other [05:02] time to go away for 12 hours ;) [05:03] * closure has 935845 hostnames including other grabs [05:39] have we set up #perposterus yet? [05:39] sorry, #preposterus [05:46] beats #closurus (no) [05:47] the constant ip banning is a bit of a problem [05:47] although I am over 1 million, on my 5th ip [05:49] i got 5000 with a single thread on my third IP. sadface. [05:50] i guess maybe balls-out is the way to go... slurp it quick 'til you're banned. [05:52] possibly. or it just keeps them opening the red bull [05:52] schhk [09:35] http://techcrunch.com/2013/02/15/posterous-will-shut-down-on-april-30th-co-founder-garry-tan-launches-posthaven-to-save-your-sites [09:37] we are already mapping the site out [09:38] but we keep getting banned [09:39] has that ever stopped us before? :D [09:41] not that I know of [09:42] banning only builds the rage [10:35] closure: the script seems to be scarce with output info [10:35] sweet, just found a picture of my great great great grandmother, great great grandmother, great grandmother and my grandmother all in a line out on the farm [11:00] :o [11:34] Could someone write to Garry Tan to ask him what the story is and how big Posterous is? [12:51] SketchCow: i have over 7k videos now in g4video-web collection [13:45] SketchCow: we know Posterous has just over 10 million sites. Size of sites unknown [20:49] closure: Thanks [21:24] Oh boy, ANOTHER person going "it was free, what do you suspect" [21:24] "What did you expect" [21:34] There's a bunch of those morons. [21:35] Don't go reading CircleHackerJerkerNews. ;-) [21:38] Wait- you mean the Y combinator message oard thinks this is fine? [21:41] Yes, basically. But it might be related to the fact that it's mainly a bunch of startup inmasturbators there [21:42] "Of COURSE it's fine to burn the house down when money runs out!" [22:10] http://www.archiveteam.org/index.php?title=Main_Page [22:12] :) [22:12] and most hners think they could build something to take posterous' place [22:13] OK, I have to go take a flight from Wellington to Auckland. [22:14] plz try not to fall in the sea. [22:14] On it [22:15] This Posterous one is different. People are getting it. [22:15] I mean, they got it with geocities and others. [22:15] But people are really getting this one, what it represents. [22:15] \o/ [22:17] OK, I'll be back on later. [22:22] SketchCow: Neat. New logo/image.