Time |
Nickname |
Message |
00:02
🔗
|
JW_work |
http://fpaste.org/354840/05052241/ <- contribution from Yoshimura, will be added to wiki later |
00:19
🔗
|
|
JesseW has joined #urlteam |
01:50
🔗
|
Yoshimura |
Pushed changes to wiki, S* T* shorteners classification and info, the changes as txt file: http://fpaste.org/354848/05120341/ |
03:00
🔗
|
|
bwn has quit IRC (Ping timeout: 492 seconds) |
03:13
🔗
|
|
Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
03:18
🔗
|
|
Yoshimura has joined #urlteam |
03:40
🔗
|
|
Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |
03:41
🔗
|
|
Yoshimura has joined #urlteam |
04:27
🔗
|
|
bwn_ has joined #urlteam |
04:33
🔗
|
|
bwn_ is now known as bwn |
05:18
🔗
|
JesseW |
nsfw-in project started (thanks Yoshimura for the investigation!) |
05:20
🔗
|
Yoshimura |
Yaaay!, I wanted to cheer about more data to get, but the amount of links at that is not that great :D (50k?) Thanks for info and shout ^^ |
05:21
🔗
|
JesseW |
you're welcome -- I'm still tweaking the regex to pick it out of the body, though |
05:21
🔗
|
JesseW |
(oops, I had it set to HEAD, not GET) -- that may have been the problem |
05:23
🔗
|
Yoshimura |
Hah. You should pick some easy to do, larger shortener, as there is not enough tasks to work on on the URL project :) |
05:23
🔗
|
JesseW |
of the ones you added, which one seemed largest (and straightforward)? |
05:24
🔗
|
Yoshimura |
Just by looking at the textfile, shorturl.com seems simplest. tiny.pl also larger, but needs one condition |
05:25
🔗
|
JesseW |
it doesn't look like you added tiny.pl to the wiki yet? |
05:25
🔗
|
JesseW |
oh, now I see it |
05:26
🔗
|
JesseW |
ok, I'll do tiny.pl next, then shorturl.com |
05:26
🔗
|
Yoshimura |
Shorturl is larger though and simpler. Well, larger by the possible combinations. |
05:27
🔗
|
JesseW |
eh, I'll get them both |
05:27
🔗
|
Yoshimura |
I did not express myself well before, did confuse, apology. |
05:27
🔗
|
JesseW |
:-) |
05:28
🔗
|
JesseW |
ah, results have come in for nsfw-in |
05:28
🔗
|
JesseW |
ok, boosting the nsfw-in queue to 20 |
05:29
🔗
|
JesseW |
I'll increase it further in a while, if it continues to be OK |
05:29
🔗
|
* |
Yoshimura still has to understand the queue limits and stuff in the api |
05:30
🔗
|
JesseW |
The queue limit is the number of jobs that can be simultaneously worked on at once. |
05:30
🔗
|
JesseW |
It's a way to avoid over-stressing the sites we are scraping |
05:30
🔗
|
JesseW |
(and avoid the all-knowing cloudflare being angry with us) |
05:31
🔗
|
Yoshimura |
I think then 20 is more then enough, as its only 50k records. |
05:31
🔗
|
JesseW |
probably, yeah |
05:32
🔗
|
JesseW |
and once it's done (in a couple of hours, probably) I'll turn it off, and re-check it again in a year or so, to grab the new ones added since then |
05:32
🔗
|
Yoshimura |
Well, in a year it will likely be 1-5 characters (1more) |
05:33
🔗
|
JesseW |
probably, but that's fine too -- the API doesn't limit the length (at least until we get into really long ones like 8 or 9 characters IIRC) |
05:34
🔗
|
Yoshimura |
Yeah, I read that script cannot handle that. Wonder why. |
05:35
🔗
|
JesseW |
because it treats shortcodes as numbers, and once they get that big, they overflow |
05:35
🔗
|
JesseW |
yes, it's a silly restriction -- but like so many things, we haven't gotten around to fixing it (yet) |
05:35
🔗
|
Yoshimura |
Lucky for Ruby (I bet if its in Python it has equivalent) has BigNum |
05:36
🔗
|
JesseW |
yeah, that would be the fix, I think |
05:36
🔗
|
JesseW |
one trick is it would need to work *in* the Warrior, which has a constrained environment |
05:37
🔗
|
Yoshimura |
How do you specify in API to use Location header? Or you just leave body regex blank? |
05:38
🔗
|
JesseW |
you can select whether to make a HEAD or GET request; if you make a HEAD request, it looks for the Location header, otherwise it uses the body regex |
05:38
🔗
|
Yoshimura |
BigNums are scalable and while they have a lot of overhead, unless its production server or data crunching, nothing to worry about. |
05:39
🔗
|
JesseW |
eh, we can discuss BigNums once there's a code to look at |
05:39
🔗
|
Yoshimura |
JesseW: Check this please if I got it right ;) |
05:39
🔗
|
Yoshimura |
{"autorelease_time":1800,"lower_sequence_num":0,"location_anti_regex":"","url_template":"http://alturl.com/{shortcode}","min_client_version":7,"name":"nsfw-in","max_num_items":1,"min_version":45,"enabled":true,"method":"head","alphabet":"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ","unavailable_codes":[],"request_delay":0.5,"auto |
05:39
🔗
|
Yoshimura |
queue":false,"redirect_codes":[302],"num_count_per_item":50,"no_redirect_codes":[200],"banned_codes":[403,420,429]} |
05:40
🔗
|
JesseW |
you have to change url_template, too |
05:40
🔗
|
Yoshimura |
No idea how you specify to use only 5 char long though. |
05:41
🔗
|
Yoshimura |
I did change the template, forgot to change the name sorry, its the shorturl.com, which does use alturl links |
05:41
🔗
|
JesseW |
you don't; as I said, it just starts from lower_sequence_num, and keeps going till you tell it to stop |
05:41
🔗
|
Yoshimura |
Oh, then maybe it should have that feature. |
05:41
🔗
|
JesseW |
probably worth adding, yeah |
05:41
🔗
|
Yoshimura |
Except the wrong name, Did I wrote it right? Also the auto-queue? |
05:42
🔗
|
JesseW |
but it's less common that something will stop right at the end of a range like that -- more often (for incremental ones) you want to just go on till it runs out |
05:42
🔗
|
JesseW |
and it's not *harmful* to keep trying non-existing ones; it's just wasteful |
05:43
🔗
|
JesseW |
I think the rest is right, yeah |
05:43
🔗
|
JesseW |
but it's not so useful, as those files aren't how new items are added. :-) |
05:43
🔗
|
JesseW |
there's a GUI, which is what I'm using |
05:44
🔗
|
Yoshimura |
What does the autoqueue exactly do, and will it run one batch, or you have to queue the first one? |
05:44
🔗
|
Yoshimura |
Oh. Ok. |
05:44
🔗
|
JesseW |
autoqueue will keep generating new jobs |
05:44
🔗
|
JesseW |
with increasingly higher initial squence numbers |
05:44
🔗
|
Yoshimura |
Is the gui somewhere and should I look into it? |
05:45
🔗
|
JesseW |
you can see the source code for the GUI in the tracker/admin section of terroroftinytown |
05:45
🔗
|
Yoshimura |
Great. |
05:45
🔗
|
JesseW |
and if you want to play with it, you can run a local copy |
05:46
🔗
|
JesseW |
but the live version is only accessible to a few of us, so we can add and maintain the projects |
05:46
🔗
|
JesseW |
so does tiny.pl really not include capital letters? |
05:47
🔗
|
Yoshimura |
Yeah, I meant to see how the GUI works. I am not sure, but I (not always) also checked google: intext:http://tiny.pl/ |
05:48
🔗
|
JesseW |
hm, looks like that, yeah |
05:48
🔗
|
Yoshimura |
Oh, bummer, so I did not do it right anyway, missed that. |
05:48
🔗
|
Yoshimura |
(The json, I mean) |
05:48
🔗
|
JesseW |
missed what? |
05:49
🔗
|
Yoshimura |
Removing capital letter from alphabet. |
05:49
🔗
|
JesseW |
ah, yeah |
05:51
🔗
|
JesseW |
ok, tiny-pl started |
05:52
🔗
|
JesseW |
and results seen; boosted queue to 30 |
05:53
🔗
|
Yoshimura |
Either it takes time to propagate or there is really enough people working on the project. |
05:54
🔗
|
JesseW |
there are a couple of REALLY BIG firehoses working on the project |
05:54
🔗
|
JesseW |
and they do grab up most of the jobs |
05:54
🔗
|
JesseW |
:-/ |
05:54
🔗
|
JesseW |
it's a nice problem to have, though |
05:54
🔗
|
Yoshimura |
well, its not about how big they are because of those shorteners, the more jobs the better. |
05:54
🔗
|
JesseW |
yep |
05:55
🔗
|
Yoshimura |
Not about BW. I got more jobs from other shorteners though. So it helped. |
05:55
🔗
|
JesseW |
:-) |
05:57
🔗
|
Yoshimura |
I am going to break my mind block and a little my budget and get machine for experimentation with unlimited transfer. Google Code seems to need it and I have other uses. Having docker really helps to do a lot of stuff on single machine. ... Thanks for the job queue work. |
05:57
🔗
|
JesseW |
certainly |
05:57
🔗
|
JesseW |
thanks for the investigations. Feel free to do more! |
05:58
🔗
|
Yoshimura |
Feel free to add more to the queue xD |
05:59
🔗
|
JesseW |
of ones to investigate, or projects? I'll get the projects added pretty soon; as for the ones to investigate -- I think there are still literally hundreds listed, so I think that's pretty well stocked for now. |
06:00
🔗
|
Yoshimura |
The projects to tracker. So far there is no shortage of shorteners to investigate, I think I can find more :D |
06:00
🔗
|
JesseW |
good |
06:02
🔗
|
JesseW |
well, I'm heading to sleep soon. G'night! |
06:14
🔗
|
Yoshimura |
Night! |
06:15
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
06:22
🔗
|
|
WinterFox has joined #urlteam |
06:22
🔗
|
|
WinterFox has quit IRC (Client Quit) |
06:23
🔗
|
|
WinterFox has joined #urlteam |
07:51
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
08:16
🔗
|
|
Muad-Dib has quit IRC (Quit: ZNC - http://znc.in) |
08:19
🔗
|
|
bwn has joined #urlteam |
11:32
🔗
|
|
Deewiant_ is now known as Deewiant |
13:08
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
13:38
🔗
|
|
twrist has joined #urlteam |
13:38
🔗
|
|
GLaDOS has quit IRC (Ping timeout: 260 seconds) |
13:38
🔗
|
|
twrist is now known as GLaDOS |
14:19
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
14:54
🔗
|
|
Start has joined #urlteam |
16:06
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
18:01
🔗
|
|
bwn has quit IRC (Ping timeout: 246 seconds) |
18:33
🔗
|
|
bwn has joined #urlteam |
18:41
🔗
|
|
BnA-Rob1n has quit IRC (Remote host closed the connection) |
19:19
🔗
|
|
BnA-Rob1n has joined #urlteam |
19:33
🔗
|
|
Start has joined #urlteam |
19:43
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
19:49
🔗
|
|
Muad-Dib has joined #urlteam |
19:49
🔗
|
|
jornane has quit IRC (Ping timeout: 244 seconds) |
19:51
🔗
|
|
jornane has joined #urlteam |
22:06
🔗
|
|
Start has joined #urlteam |
23:10
🔗
|
|
Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) |