[01:19] *** Zerote has quit IRC (Ping timeout: 260 seconds)
[01:22] *** tech234a has joined #urlteam
[03:18] *** richid has joined #urlteam
[03:21] <richid> Hey folks, can I make a request to stop (or slow) a crawl?
[03:21] <Flashfire> You certainly can
[03:21] <Flashfire> What crawl would that be richid?
[03:22] <Flashfire> Who are you representing?
[03:22] <richid> In particular, https://like2b.uy
[03:22] <Flashfire> Certainly I can
[03:22] <richid> It looks like it, but it's not a URL shortener
[03:22] <Flashfire> I honestly forgot i turned that back on. It was supposed to stay off
[03:22] <richid> And if you take a look at the logs, it's all just 302s to the same URL
[03:23] <Flashfire> I will turn off auto queue and let the current jobs finish
[03:23] <richid> Looks like it already stopped :)
[03:23] <Flashfire> I think that one came from an instagram search there are a few that werent redirects to the same url but most of them are
[03:24] <Flashfire> Yeah the items were coming really fast
[03:24] <Flashfire> Do you have any reccomendations for url shorteners to start crawling richid?
[03:25] <Flashfire> I did look into it again it looks like they are redirects to store fronts through instagram but most of them are longer than the short codes we are scanning
[03:25] <Flashfire> sorry about that
[03:25] <Flashfire> But if you have any suggestions to add or turn back onto the tracker just let me know
[03:26] <richid> No worries, thanks for being so responsive
[03:26] <Flashfire> All Good I was watching Hulu when I saw the message pop up on my client
[03:27] <richid> After looking at your tracker, you've got way more shorteners in there than I even know about
[03:27] <Flashfire> So as I said if there are any shorteners you think should be added to the tracker or turned back on just give me a ping and I will see what I can do
[03:27] <Flashfire> Ahahaha yeah we have a fair few. I mainly stick to the simple ones I havent worked out coding just yet. But if you have any suggestions just send them my way
[03:30] <richid> Will do! and thanks for doing this work, I'm a big fan of archiving projects
[03:30] <Flashfire> my absolute pleasure. I am the same I came to archive team only a year or 2 ago and fell in love with the URLteam project
[03:31] <Flashfire> I always came across heaps in my journeys to the weird scummy parts of the net and was glad someone came to preserve them
[03:32] <Somebody2> Thanks for being understanding, richid!
[03:33] <Somebody2> Flashfire: re a.gd -- always better to add more info
[03:33] <Flashfire> Somebody2 do I clean up and delete the existing info or just tack the new info onto the end
[03:36] <Flashfire> also for something like 1click.im - DNS not responding as of 15:48, 15 May 2016 (EDT) is it ok if I update it to not responding as of 2019 or is that a stupid change to make Somebody2
[03:37] *** richid has quit IRC (Quit: Page closed)
[03:37] *** odemg has quit IRC (Ping timeout: 615 seconds)
[03:43] <Flashfire> I am going to go ahead and do it you can revert my changes later if you prefer
[03:44] *** odemg has joined #urlteam
[04:12] *** tech234a has quit IRC (Quit: Connection closed for inactivity)
[04:25] <Flashfire> Does anyone know if we are starting the goo.gl scrape soon?
[04:26] <Flashfire> I might start it up on the very smallest queue when this current export finishes
[04:28] <Flashfire> Does anyone know if goo.gl uses _
[04:31] *** warmwaffl has quit IRC (Remote host closed the connection)
[04:41] <t3> Flashfire: I am trying to find that out for you.
[04:47] <t3> I tried to find URLSs with `_`, but I did not find any with a quick search. I also tried to experiment using URLs with the underscore, but they did not redirect and only gave 404s. I do not think underscores were used.
[04:48] <t3> I hope that helps.
[05:04] <Fusl> 2018-12-25 06:53:37     Fusl    FYI if this wasn't known yet, goo.gl urls currently are 4-6 chars A-Za-z0-9, 301 is redirect, 404 is not found, 403 is banned
[05:06] <Flashfire> no 3 characters at all Fusl?
[05:09] <Flashfire> Hmmm does goo.gl only work for https?
[05:11] *** tech234a has joined #urlteam
[05:14] <Fusl> two-three chars are reserved by google
[05:15] <Flashfire> But that means that they still exist maybe
[05:15] <Fusl> 2018-12-25 06:55:30     Fusl    1-2 char urls are reserved for google internal use, 3 char urls don't seem to be used at all
[05:15] <Fusl> 2018-12-25 07:02:27     Fusl    200 is for deleted URLs
[05:15] <Fusl> 2018-12-25 07:11:38     Fusl    here are my bets for how many URLs there are 301ing: 4: 11.60M, 5: 765.89M, 6: 7838.43M
[05:15] <Fusl> 2018-12-25 12:28:49     @JAA    Fusl: Thanks. Do you know whether the codes are incremental or random?
[05:15] <Fusl> 2018-12-25 12:31:03     Fusl    they're random
[05:15] <Fusl> 2018-12-25 12:31:22     Fusl    at least from my tests
[05:15] <Flashfire> I mean if they are used then maybe I could start them from 0 and have it super slow
[05:20] <Flashfire> Fusl do you think that 3 batches of 10 urls at a time will trigger rate limiting?
[05:21] <Flashfire> I have the project almost ready 
[05:21] <Fusl> Flashfire: i can do a thousand requests within a short time without even getting a single 403
[05:22] <Flashfire> Fusl arent you rotating IPs though?
[05:22] <Flashfire> Also I want to play it safe damn it
[05:22] <Fusl> tested on my laptop routed over a single ISP
[05:22] <Flashfire> I will set it up the same way I did Puri.na until I am confident
[05:23] <Fusl> also
[05:23] <Fusl> 302 https://goo.gl/v6BJWm https://www.google.com/sorry/index?continue=https://goo.gl/v6BJWm&q=EgRR2W9pGLCT4OUFIhkA8aeDS3sL7I_rQ10HPgBD6zBENIpCeTnsMgFy
[05:23] <Fusl> banned ^
[05:23] <Fusl> http://xor.meo.ws/Y7outygiuCXUlhYFXHGQDkCQh84uL6Yb.txt
[05:24] <Flashfire> when is that from?
[05:24] <Flashfire> I will add 302 to the banned list
[05:24] <Fusl> just now
[05:24] <Fusl> no
[05:24] <Fusl> 302 is a redirect
[05:24] <Fusl> but
[05:24] <Flashfire> Setting it up same as Puri.na 10 urls in batches of 5?
[05:24] <Fusl> 302 https://goo.gl/buJZZI http://www.pinkrod.com/videos/13912/horny-michelle-lay-having-fun-with-a-young-guy/
[05:24] <Fusl> this one is good
[05:25] <Fusl> but a 302 to https://www.google.com/sorry/index* is not
[05:25] <Flashfire> Ok so I will make the queue and url batches low and see how I go
[05:25] <Flashfire> 5 batches of 10 running at once should be fine
[05:26] <Flashfire> If you dont have any objections fusl I will start it
[05:26] <Flashfire> I will keep an eye on it for as long as I can
[05:27] <Fusl> Flashfire: are we able to detect a 302 to https://www.google.com/sorry/index* as banned?
[05:27] <Fusl> thats my biggest concern
[05:27] <Flashfire> Location header reject regular expression: 
[05:27] <Flashfire> that might be helpful
[05:27] <Flashfire> I dont know exactly how all of the tracker works
[05:28] <Flashfire> I assumed if we kept the queue low enough then we wouldnt run into it 
[05:28] <Flashfire> Keep an eye on it as best we can
[05:28] <Fusl> thats not really a good way to ensure quality of the grabs
[05:29] <Flashfire> I know but its the best I have unless I just leave the settings as is and wait for someone more experiences
[05:30] <Fusl> Python regex to match non-redirect values in the Location header. Use to reject things like links to the shortener's homepage.
[05:31] <Flashfire> Well if you want to log into the tracker you do have access
[05:31] <Fusl> yeah im looking at that right now
[05:35] <Fusl> custom code?
[05:36] <Fusl> location header regex doesnt seem to do what we want
[05:37] <Fusl> clients that run into the 302 banned redirect will just keep consuming items from the queue
[05:37] <Flashfire> well shit
[05:37] <Flashfire> well I can not write custom code I cant code
[05:37] <Fusl> ill take a look
[05:54] <Flashfire> added digbig
[05:59] <Fusl> k i think i got the custom script to capture that rate limit redirect, just gonna do some tests
[05:59] <Flashfire> ok feel free
[05:59] <Flashfire> I am just cleaning up error reports I will avoid the googl ones
[06:14] <Fusl_> https://github.com/Fusl/terroroftinytown/commit/cbffc8c80319eebaf2155c4d44ebb26f3f139bcb this should do the trick, still figuring out how to do the tests tho
[07:20] *** tech234a has quit IRC (Quit: Connection closed for inactivity)
[07:40] <Flashfire> https://1n.pm/
[07:41] <Flashfire> https://loo.lt/
[07:42] <Flashfire> http://vrg.sk/
[07:51] <Fusl> https://github.com/ArchiveTeam/terroroftinytown/pull/67
[08:01] <Flashfire> Hey guys um Zapt.In is listed as dead on the wiki but its short urls are resolving
[08:01] <Fusl> chfoo: first time adding a service custom script to urlteam, can you take a look at this and lmk if i need anything else changed?
[08:03] <Flashfire> Wait hold on never mind they are set to redirect through the wayback machine
[08:29] *** Zerote has joined #urlteam
[09:02] <Kagee> The documentation for using the docker image with envirionment variables at https://hub.docker.com/r/archiveteam/warrior-dockerfile/ appears to have an "-e" that makes the command fail
[09:04] <Kagee> looks like it was fixed 17 days ago on github :/
[09:07] <Flashfire> FUSL
[09:08] <Fusl> Flashfire: ye?
[09:08] <Flashfire> above
[09:08] <Flashfire> you made the docker file thing didnt you?
[09:08] <Fusl> nope
[09:09] <Fusl> i wish i had access to the archiveteam org so that i could set up autobuilds
[09:10] <Kagee> i can run multiple warriors on different ip's, right?
[09:10] <Kagee> i've got to many underused VPS'es hanging around
[09:10] <Fusl> yea
[09:20] *** mtntmnky has quit IRC (Remote host closed the connection)
[09:20] *** mtntmnky has joined #urlteam
[09:46] <Zerote> Flashfire: Still looking for some shortener suggestions? I recently added a small handful of YOURLS based ones to the wiki
[10:04] <Kagee> Any way to check stats for my own nick on the tracker?
[10:06] <Zerote> Kagee: You can select the row limit on the tracker, set it to 5000 or so, and CTRL+F your nick. It's pretty heavy to render and update that many rows though, so your browser might be a bit sluggish
[10:10] <Zerote> I wrote a small python program that gets the data from the websocket and displays some more interesting info like scans/hr, but the code is a bit too sloppy to post on github right now
[11:13] <JAA> So regarding goo.gl, I was hoping we could get URLTeam to write WARCs before starting that one. Although I guess we could still do that later and regrab them if necessary. The WARC thing probably won't happen anytime soon anyway.
[11:20] <Fusl> yeah warcing would be so great for urlteam
[11:20] <Fusl> https://github.com/ArchiveTeam/terroroftinytown/issues/1
[11:20] <Fusl> issue #1
[11:23] <Fusl> the thing is. how do we synchronize the warcs over to a master server? are we gonna do it in a similar fashion as the other tracker projects work by rsyncing them or do we push them to the tracker server?
[11:27] <JAA> I don't think we'll want to put even more work on the tracker, so something like the other projects is probably best.
[12:49] <Kagee> Zerote: made a very ugly hack while working on mobile: wscat --connect "wss://tracker.archiveteam.org:1338/api/live_stats" 2>/dev/null  | grep -m 1 Kagee | sed 's/{/\n{/' | tail -1 | jq '.lifetime["Kagee"] |  "Found: \(.[0]), scanned: \(.[1])"'                                                                                              "Found: 38194, scanned: 215085"
[14:05] *** Hani has quit IRC (Quit: Going offline, see ya! (www.adiirc.com))
[14:49] *** seatsea has joined #urlteam
[18:10] *** ave_ has joined #urlteam
[20:30] *** ave_ has quit IRC (Quit: Connection closed for inactivity)
[21:46] *** Soni has joined #urlteam