#urlteam 2016-04-13,Wed

↑back Search

Time Nickname Message
00:02 🔗 JW_work http://fpaste.org/354840/05052241/ <- contribution from Yoshimura, will be added to wiki later
00:19 🔗 JesseW has joined #urlteam
01:50 🔗 Yoshimura Pushed changes to wiki, S* T* shorteners classification and info, the changes as txt file: http://fpaste.org/354848/05120341/
03:00 🔗 bwn has quit IRC (Ping timeout: 492 seconds)
03:13 🔗 Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
03:18 🔗 Yoshimura has joined #urlteam
03:40 🔗 Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
03:41 🔗 Yoshimura has joined #urlteam
04:27 🔗 bwn_ has joined #urlteam
04:33 🔗 bwn_ is now known as bwn
05:18 🔗 JesseW nsfw-in project started (thanks Yoshimura for the investigation!)
05:20 🔗 Yoshimura Yaaay!, I wanted to cheer about more data to get, but the amount of links at that is not that great :D (50k?) Thanks for info and shout ^^
05:21 🔗 JesseW you're welcome -- I'm still tweaking the regex to pick it out of the body, though
05:21 🔗 JesseW (oops, I had it set to HEAD, not GET) -- that may have been the problem
05:23 🔗 Yoshimura Hah. You should pick some easy to do, larger shortener, as there is not enough tasks to work on on the URL project :)
05:23 🔗 JesseW of the ones you added, which one seemed largest (and straightforward)?
05:24 🔗 Yoshimura Just by looking at the textfile, shorturl.com seems simplest. tiny.pl also larger, but needs one condition
05:25 🔗 JesseW it doesn't look like you added tiny.pl to the wiki yet?
05:25 🔗 JesseW oh, now I see it
05:26 🔗 JesseW ok, I'll do tiny.pl next, then shorturl.com
05:26 🔗 Yoshimura Shorturl is larger though and simpler. Well, larger by the possible combinations.
05:27 🔗 JesseW eh, I'll get them both
05:27 🔗 Yoshimura I did not express myself well before, did confuse, apology.
05:27 🔗 JesseW :-)
05:28 🔗 JesseW ah, results have come in for nsfw-in
05:28 🔗 JesseW ok, boosting the nsfw-in queue to 20
05:29 🔗 JesseW I'll increase it further in a while, if it continues to be OK
05:29 🔗 * Yoshimura still has to understand the queue limits and stuff in the api
05:30 🔗 JesseW The queue limit is the number of jobs that can be simultaneously worked on at once.
05:30 🔗 JesseW It's a way to avoid over-stressing the sites we are scraping
05:30 🔗 JesseW (and avoid the all-knowing cloudflare being angry with us)
05:31 🔗 Yoshimura I think then 20 is more then enough, as its only 50k records.
05:31 🔗 JesseW probably, yeah
05:32 🔗 JesseW and once it's done (in a couple of hours, probably) I'll turn it off, and re-check it again in a year or so, to grab the new ones added since then
05:32 🔗 Yoshimura Well, in a year it will likely be 1-5 characters (1more)
05:33 🔗 JesseW probably, but that's fine too -- the API doesn't limit the length (at least until we get into really long ones like 8 or 9 characters IIRC)
05:34 🔗 Yoshimura Yeah, I read that script cannot handle that. Wonder why.
05:35 🔗 JesseW because it treats shortcodes as numbers, and once they get that big, they overflow
05:35 🔗 JesseW yes, it's a silly restriction -- but like so many things, we haven't gotten around to fixing it (yet)
05:35 🔗 Yoshimura Lucky for Ruby (I bet if its in Python it has equivalent) has BigNum
05:36 🔗 JesseW yeah, that would be the fix, I think
05:36 🔗 JesseW one trick is it would need to work *in* the Warrior, which has a constrained environment
05:37 🔗 Yoshimura How do you specify in API to use Location header? Or you just leave body regex blank?
05:38 🔗 JesseW you can select whether to make a HEAD or GET request; if you make a HEAD request, it looks for the Location header, otherwise it uses the body regex
05:38 🔗 Yoshimura BigNums are scalable and while they have a lot of overhead, unless its production server or data crunching, nothing to worry about.
05:39 🔗 JesseW eh, we can discuss BigNums once there's a code to look at
05:39 🔗 Yoshimura JesseW: Check this please if I got it right ;)
05:39 🔗 Yoshimura {"autorelease_time":1800,"lower_sequence_num":0,"location_anti_regex":"","url_template":"http://alturl.com/{shortcode}","min_client_version":7,"name":"nsfw-in","max_num_items":1,"min_version":45,"enabled":true,"method":"head","alphabet":"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ","unavailable_codes":[],"request_delay":0.5,"auto
05:39 🔗 Yoshimura queue":false,"redirect_codes":[302],"num_count_per_item":50,"no_redirect_codes":[200],"banned_codes":[403,420,429]}
05:40 🔗 JesseW you have to change url_template, too
05:40 🔗 Yoshimura No idea how you specify to use only 5 char long though.
05:41 🔗 Yoshimura I did change the template, forgot to change the name sorry, its the shorturl.com, which does use alturl links
05:41 🔗 JesseW you don't; as I said, it just starts from lower_sequence_num, and keeps going till you tell it to stop
05:41 🔗 Yoshimura Oh, then maybe it should have that feature.
05:41 🔗 JesseW probably worth adding, yeah
05:41 🔗 Yoshimura Except the wrong name, Did I wrote it right? Also the auto-queue?
05:42 🔗 JesseW but it's less common that something will stop right at the end of a range like that -- more often (for incremental ones) you want to just go on till it runs out
05:42 🔗 JesseW and it's not *harmful* to keep trying non-existing ones; it's just wasteful
05:43 🔗 JesseW I think the rest is right, yeah
05:43 🔗 JesseW but it's not so useful, as those files aren't how new items are added. :-)
05:43 🔗 JesseW there's a GUI, which is what I'm using
05:44 🔗 Yoshimura What does the autoqueue exactly do, and will it run one batch, or you have to queue the first one?
05:44 🔗 Yoshimura Oh. Ok.
05:44 🔗 JesseW autoqueue will keep generating new jobs
05:44 🔗 JesseW with increasingly higher initial squence numbers
05:44 🔗 Yoshimura Is the gui somewhere and should I look into it?
05:45 🔗 JesseW you can see the source code for the GUI in the tracker/admin section of terroroftinytown
05:45 🔗 Yoshimura Great.
05:45 🔗 JesseW and if you want to play with it, you can run a local copy
05:46 🔗 JesseW but the live version is only accessible to a few of us, so we can add and maintain the projects
05:46 🔗 JesseW so does tiny.pl really not include capital letters?
05:47 🔗 Yoshimura Yeah, I meant to see how the GUI works. I am not sure, but I (not always) also checked google: intext:http://tiny.pl/
05:48 🔗 JesseW hm, looks like that, yeah
05:48 🔗 Yoshimura Oh, bummer, so I did not do it right anyway, missed that.
05:48 🔗 Yoshimura (The json, I mean)
05:48 🔗 JesseW missed what?
05:49 🔗 Yoshimura Removing capital letter from alphabet.
05:49 🔗 JesseW ah, yeah
05:51 🔗 JesseW ok, tiny-pl started
05:52 🔗 JesseW and results seen; boosted queue to 30
05:53 🔗 Yoshimura Either it takes time to propagate or there is really enough people working on the project.
05:54 🔗 JesseW there are a couple of REALLY BIG firehoses working on the project
05:54 🔗 JesseW and they do grab up most of the jobs
05:54 🔗 JesseW :-/
05:54 🔗 JesseW it's a nice problem to have, though
05:54 🔗 Yoshimura well, its not about how big they are because of those shorteners, the more jobs the better.
05:54 🔗 JesseW yep
05:55 🔗 Yoshimura Not about BW. I got more jobs from other shorteners though. So it helped.
05:55 🔗 JesseW :-)
05:57 🔗 Yoshimura I am going to break my mind block and a little my budget and get machine for experimentation with unlimited transfer. Google Code seems to need it and I have other uses. Having docker really helps to do a lot of stuff on single machine. ... Thanks for the job queue work.
05:57 🔗 JesseW certainly
05:57 🔗 JesseW thanks for the investigations. Feel free to do more!
05:58 🔗 Yoshimura Feel free to add more to the queue xD
05:59 🔗 JesseW of ones to investigate, or projects? I'll get the projects added pretty soon; as for the ones to investigate -- I think there are still literally hundreds listed, so I think that's pretty well stocked for now.
06:00 🔗 Yoshimura The projects to tracker. So far there is no shortage of shorteners to investigate, I think I can find more :D
06:00 🔗 JesseW good
06:02 🔗 JesseW well, I'm heading to sleep soon. G'night!
06:14 🔗 Yoshimura Night!
06:15 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
06:22 🔗 WinterFox has joined #urlteam
06:22 🔗 WinterFox has quit IRC (Client Quit)
06:23 🔗 WinterFox has joined #urlteam
07:51 🔗 bwn has quit IRC (Read error: Operation timed out)
08:16 🔗 Muad-Dib has quit IRC (Quit: ZNC - http://znc.in)
08:19 🔗 bwn has joined #urlteam
11:32 🔗 Deewiant_ is now known as Deewiant
13:08 🔗 WinterFox has quit IRC (Remote host closed the connection)
13:38 🔗 twrist has joined #urlteam
13:38 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
13:38 🔗 twrist is now known as GLaDOS
14:19 🔗 Start has quit IRC (Quit: Disconnected.)
14:54 🔗 Start has joined #urlteam
16:06 🔗 Start has quit IRC (Quit: Disconnected.)
18:01 🔗 bwn has quit IRC (Ping timeout: 246 seconds)
18:33 🔗 bwn has joined #urlteam
18:41 🔗 BnA-Rob1n has quit IRC (Remote host closed the connection)
19:19 🔗 BnA-Rob1n has joined #urlteam
19:33 🔗 Start has joined #urlteam
19:43 🔗 Start has quit IRC (Quit: Disconnected.)
19:49 🔗 Muad-Dib has joined #urlteam
19:49 🔗 jornane has quit IRC (Ping timeout: 244 seconds)
19:51 🔗 jornane has joined #urlteam
22:06 🔗 Start has joined #urlteam
23:10 🔗 Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)

irclogger-viewer