[00:02] http://fpaste.org/354840/05052241/ <- contribution from Yoshimura, will be added to wiki later [00:19] *** JesseW has joined #urlteam [01:50] Pushed changes to wiki, S* T* shorteners classification and info, the changes as txt file: http://fpaste.org/354848/05120341/ [03:00] *** bwn has quit IRC (Ping timeout: 492 seconds) [03:13] *** Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [03:18] *** Yoshimura has joined #urlteam [03:40] *** Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [03:41] *** Yoshimura has joined #urlteam [04:27] *** bwn_ has joined #urlteam [04:33] *** bwn_ is now known as bwn [05:18] nsfw-in project started (thanks Yoshimura for the investigation!) [05:20] Yaaay!, I wanted to cheer about more data to get, but the amount of links at that is not that great :D (50k?) Thanks for info and shout ^^ [05:21] you're welcome -- I'm still tweaking the regex to pick it out of the body, though [05:21] (oops, I had it set to HEAD, not GET) -- that may have been the problem [05:23] Hah. You should pick some easy to do, larger shortener, as there is not enough tasks to work on on the URL project :) [05:23] of the ones you added, which one seemed largest (and straightforward)? [05:24] Just by looking at the textfile, shorturl.com seems simplest. tiny.pl also larger, but needs one condition [05:25] it doesn't look like you added tiny.pl to the wiki yet? [05:25] oh, now I see it [05:26] ok, I'll do tiny.pl next, then shorturl.com [05:26] Shorturl is larger though and simpler. Well, larger by the possible combinations. [05:27] eh, I'll get them both [05:27] I did not express myself well before, did confuse, apology. [05:27] :-) [05:28] ah, results have come in for nsfw-in [05:28] ok, boosting the nsfw-in queue to 20 [05:29] I'll increase it further in a while, if it continues to be OK [05:29] * Yoshimura still has to understand the queue limits and stuff in the api [05:30] The queue limit is the number of jobs that can be simultaneously worked on at once. [05:30] It's a way to avoid over-stressing the sites we are scraping [05:30] (and avoid the all-knowing cloudflare being angry with us) [05:31] I think then 20 is more then enough, as its only 50k records. [05:31] probably, yeah [05:32] and once it's done (in a couple of hours, probably) I'll turn it off, and re-check it again in a year or so, to grab the new ones added since then [05:32] Well, in a year it will likely be 1-5 characters (1more) [05:33] probably, but that's fine too -- the API doesn't limit the length (at least until we get into really long ones like 8 or 9 characters IIRC) [05:34] Yeah, I read that script cannot handle that. Wonder why. [05:35] because it treats shortcodes as numbers, and once they get that big, they overflow [05:35] yes, it's a silly restriction -- but like so many things, we haven't gotten around to fixing it (yet) [05:35] Lucky for Ruby (I bet if its in Python it has equivalent) has BigNum [05:36] yeah, that would be the fix, I think [05:36] one trick is it would need to work *in* the Warrior, which has a constrained environment [05:37] How do you specify in API to use Location header? Or you just leave body regex blank? [05:38] you can select whether to make a HEAD or GET request; if you make a HEAD request, it looks for the Location header, otherwise it uses the body regex [05:38] BigNums are scalable and while they have a lot of overhead, unless its production server or data crunching, nothing to worry about. [05:39] eh, we can discuss BigNums once there's a code to look at [05:39] JesseW: Check this please if I got it right ;) [05:39] {"autorelease_time":1800,"lower_sequence_num":0,"location_anti_regex":"","url_template":"http://alturl.com/{shortcode}","min_client_version":7,"name":"nsfw-in","max_num_items":1,"min_version":45,"enabled":true,"method":"head","alphabet":"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ","unavailable_codes":[],"request_delay":0.5,"auto [05:39] queue":false,"redirect_codes":[302],"num_count_per_item":50,"no_redirect_codes":[200],"banned_codes":[403,420,429]} [05:40] you have to change url_template, too [05:40] No idea how you specify to use only 5 char long though. [05:41] I did change the template, forgot to change the name sorry, its the shorturl.com, which does use alturl links [05:41] you don't; as I said, it just starts from lower_sequence_num, and keeps going till you tell it to stop [05:41] Oh, then maybe it should have that feature. [05:41] probably worth adding, yeah [05:41] Except the wrong name, Did I wrote it right? Also the auto-queue? [05:42] but it's less common that something will stop right at the end of a range like that -- more often (for incremental ones) you want to just go on till it runs out [05:42] and it's not *harmful* to keep trying non-existing ones; it's just wasteful [05:43] I think the rest is right, yeah [05:43] but it's not so useful, as those files aren't how new items are added. :-) [05:43] there's a GUI, which is what I'm using [05:44] What does the autoqueue exactly do, and will it run one batch, or you have to queue the first one? [05:44] Oh. Ok. [05:44] autoqueue will keep generating new jobs [05:44] with increasingly higher initial squence numbers [05:44] Is the gui somewhere and should I look into it? [05:45] you can see the source code for the GUI in the tracker/admin section of terroroftinytown [05:45] Great. [05:45] and if you want to play with it, you can run a local copy [05:46] but the live version is only accessible to a few of us, so we can add and maintain the projects [05:46] so does tiny.pl really not include capital letters? [05:47] Yeah, I meant to see how the GUI works. I am not sure, but I (not always) also checked google: intext:http://tiny.pl/ [05:48] hm, looks like that, yeah [05:48] Oh, bummer, so I did not do it right anyway, missed that. [05:48] (The json, I mean) [05:48] missed what? [05:49] Removing capital letter from alphabet. [05:49] ah, yeah [05:51] ok, tiny-pl started [05:52] and results seen; boosted queue to 30 [05:53] Either it takes time to propagate or there is really enough people working on the project. [05:54] there are a couple of REALLY BIG firehoses working on the project [05:54] and they do grab up most of the jobs [05:54] :-/ [05:54] it's a nice problem to have, though [05:54] well, its not about how big they are because of those shorteners, the more jobs the better. [05:54] yep [05:55] Not about BW. I got more jobs from other shorteners though. So it helped. [05:55] :-) [05:57] I am going to break my mind block and a little my budget and get machine for experimentation with unlimited transfer. Google Code seems to need it and I have other uses. Having docker really helps to do a lot of stuff on single machine. ... Thanks for the job queue work. [05:57] certainly [05:57] thanks for the investigations. Feel free to do more! [05:58] Feel free to add more to the queue xD [05:59] of ones to investigate, or projects? I'll get the projects added pretty soon; as for the ones to investigate -- I think there are still literally hundreds listed, so I think that's pretty well stocked for now. [06:00] The projects to tracker. So far there is no shortage of shorteners to investigate, I think I can find more :D [06:00] good [06:02] well, I'm heading to sleep soon. G'night! [06:14] Night! [06:15] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:22] *** WinterFox has joined #urlteam [06:22] *** WinterFox has quit IRC (Client Quit) [06:23] *** WinterFox has joined #urlteam [07:51] *** bwn has quit IRC (Read error: Operation timed out) [08:16] *** Muad-Dib has quit IRC (Quit: ZNC - http://znc.in) [08:19] *** bwn has joined #urlteam [11:32] *** Deewiant_ is now known as Deewiant [13:08] *** WinterFox has quit IRC (Remote host closed the connection) [13:38] *** twrist has joined #urlteam [13:38] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [13:38] *** twrist is now known as GLaDOS [14:19] *** Start has quit IRC (Quit: Disconnected.) [14:54] *** Start has joined #urlteam [16:06] *** Start has quit IRC (Quit: Disconnected.) [18:01] *** bwn has quit IRC (Ping timeout: 246 seconds) [18:33] *** bwn has joined #urlteam [18:41] *** BnA-Rob1n has quit IRC (Remote host closed the connection) [19:19] *** BnA-Rob1n has joined #urlteam [19:33] *** Start has joined #urlteam [19:43] *** Start has quit IRC (Quit: Disconnected.) [19:49] *** Muad-Dib has joined #urlteam [19:49] *** jornane has quit IRC (Ping timeout: 244 seconds) [19:51] *** jornane has joined #urlteam [22:06] *** Start has joined #urlteam [23:10] *** Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)