[01:19] *** Zerote has quit IRC (Ping timeout: 260 seconds) [01:22] *** tech234a has joined #urlteam [03:18] *** richid has joined #urlteam [03:21] Hey folks, can I make a request to stop (or slow) a crawl? [03:21] You certainly can [03:21] What crawl would that be richid? [03:22] Who are you representing? [03:22] In particular, https://like2b.uy [03:22] Certainly I can [03:22] It looks like it, but it's not a URL shortener [03:22] I honestly forgot i turned that back on. It was supposed to stay off [03:22] And if you take a look at the logs, it's all just 302s to the same URL [03:23] I will turn off auto queue and let the current jobs finish [03:23] Looks like it already stopped :) [03:23] I think that one came from an instagram search there are a few that werent redirects to the same url but most of them are [03:24] Yeah the items were coming really fast [03:24] Do you have any reccomendations for url shorteners to start crawling richid? [03:25] I did look into it again it looks like they are redirects to store fronts through instagram but most of them are longer than the short codes we are scanning [03:25] sorry about that [03:25] But if you have any suggestions to add or turn back onto the tracker just let me know [03:26] No worries, thanks for being so responsive [03:26] All Good I was watching Hulu when I saw the message pop up on my client [03:27] After looking at your tracker, you've got way more shorteners in there than I even know about [03:27] So as I said if there are any shorteners you think should be added to the tracker or turned back on just give me a ping and I will see what I can do [03:27] Ahahaha yeah we have a fair few. I mainly stick to the simple ones I havent worked out coding just yet. But if you have any suggestions just send them my way [03:30] Will do! and thanks for doing this work, I'm a big fan of archiving projects [03:30] my absolute pleasure. I am the same I came to archive team only a year or 2 ago and fell in love with the URLteam project [03:31] I always came across heaps in my journeys to the weird scummy parts of the net and was glad someone came to preserve them [03:32] Thanks for being understanding, richid! [03:33] Flashfire: re a.gd -- always better to add more info [03:33] Somebody2 do I clean up and delete the existing info or just tack the new info onto the end [03:36] also for something like 1click.im - DNS not responding as of 15:48, 15 May 2016 (EDT) is it ok if I update it to not responding as of 2019 or is that a stupid change to make Somebody2 [03:37] *** richid has quit IRC (Quit: Page closed) [03:37] *** odemg has quit IRC (Ping timeout: 615 seconds) [03:43] I am going to go ahead and do it you can revert my changes later if you prefer [03:44] *** odemg has joined #urlteam [04:12] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [04:25] Does anyone know if we are starting the goo.gl scrape soon? [04:26] I might start it up on the very smallest queue when this current export finishes [04:28] Does anyone know if goo.gl uses _ [04:31] *** warmwaffl has quit IRC (Remote host closed the connection) [04:41] Flashfire: I am trying to find that out for you. [04:47] I tried to find URLSs with `_`, but I did not find any with a quick search. I also tried to experiment using URLs with the underscore, but they did not redirect and only gave 404s. I do not think underscores were used. [04:48] I hope that helps. [05:04] 2018-12-25 06:53:37 Fusl FYI if this wasn't known yet, goo.gl urls currently are 4-6 chars A-Za-z0-9, 301 is redirect, 404 is not found, 403 is banned [05:06] no 3 characters at all Fusl? [05:09] Hmmm does goo.gl only work for https? [05:11] *** tech234a has joined #urlteam [05:14] two-three chars are reserved by google [05:15] But that means that they still exist maybe [05:15] 2018-12-25 06:55:30 Fusl 1-2 char urls are reserved for google internal use, 3 char urls don't seem to be used at all [05:15] 2018-12-25 07:02:27 Fusl 200 is for deleted URLs [05:15] 2018-12-25 07:11:38 Fusl here are my bets for how many URLs there are 301ing: 4: 11.60M, 5: 765.89M, 6: 7838.43M [05:15] 2018-12-25 12:28:49 @JAA Fusl: Thanks. Do you know whether the codes are incremental or random? [05:15] 2018-12-25 12:31:03 Fusl they're random [05:15] 2018-12-25 12:31:22 Fusl at least from my tests [05:15] I mean if they are used then maybe I could start them from 0 and have it super slow [05:20] Fusl do you think that 3 batches of 10 urls at a time will trigger rate limiting? [05:21] I have the project almost ready [05:21] Flashfire: i can do a thousand requests within a short time without even getting a single 403 [05:22] Fusl arent you rotating IPs though? [05:22] Also I want to play it safe damn it [05:22] tested on my laptop routed over a single ISP [05:22] I will set it up the same way I did Puri.na until I am confident [05:23] also [05:23] 302 https://goo.gl/v6BJWm https://www.google.com/sorry/index?continue=https://goo.gl/v6BJWm&q=EgRR2W9pGLCT4OUFIhkA8aeDS3sL7I_rQ10HPgBD6zBENIpCeTnsMgFy [05:23] banned ^ [05:23] http://xor.meo.ws/Y7outygiuCXUlhYFXHGQDkCQh84uL6Yb.txt [05:24] when is that from? [05:24] I will add 302 to the banned list [05:24] just now [05:24] no [05:24] 302 is a redirect [05:24] but [05:24] Setting it up same as Puri.na 10 urls in batches of 5? [05:24] 302 https://goo.gl/buJZZI http://www.pinkrod.com/videos/13912/horny-michelle-lay-having-fun-with-a-young-guy/ [05:24] this one is good [05:25] but a 302 to https://www.google.com/sorry/index* is not [05:25] Ok so I will make the queue and url batches low and see how I go [05:25] 5 batches of 10 running at once should be fine [05:26] If you dont have any objections fusl I will start it [05:26] I will keep an eye on it for as long as I can [05:27] Flashfire: are we able to detect a 302 to https://www.google.com/sorry/index* as banned? [05:27] thats my biggest concern [05:27] Location header reject regular expression: [05:27] that might be helpful [05:27] I dont know exactly how all of the tracker works [05:28] I assumed if we kept the queue low enough then we wouldnt run into it [05:28] Keep an eye on it as best we can [05:28] thats not really a good way to ensure quality of the grabs [05:29] I know but its the best I have unless I just leave the settings as is and wait for someone more experiences [05:30] Python regex to match non-redirect values in the Location header. Use to reject things like links to the shortener's homepage. [05:31] Well if you want to log into the tracker you do have access [05:31] yeah im looking at that right now [05:35] custom code? [05:36] location header regex doesnt seem to do what we want [05:37] clients that run into the 302 banned redirect will just keep consuming items from the queue [05:37] well shit [05:37] well I can not write custom code I cant code [05:37] ill take a look [05:54] added digbig [05:59] k i think i got the custom script to capture that rate limit redirect, just gonna do some tests [05:59] ok feel free [05:59] I am just cleaning up error reports I will avoid the googl ones [06:14] https://github.com/Fusl/terroroftinytown/commit/cbffc8c80319eebaf2155c4d44ebb26f3f139bcb this should do the trick, still figuring out how to do the tests tho [07:20] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [07:40] https://1n.pm/ [07:41] https://loo.lt/ [07:42] http://vrg.sk/ [07:51] https://github.com/ArchiveTeam/terroroftinytown/pull/67 [08:01] Hey guys um Zapt.In is listed as dead on the wiki but its short urls are resolving [08:01] chfoo: first time adding a service custom script to urlteam, can you take a look at this and lmk if i need anything else changed? [08:03] Wait hold on never mind they are set to redirect through the wayback machine [08:29] *** Zerote has joined #urlteam [09:02] The documentation for using the docker image with envirionment variables at https://hub.docker.com/r/archiveteam/warrior-dockerfile/ appears to have an "-e" that makes the command fail [09:04] looks like it was fixed 17 days ago on github :/ [09:07] FUSL [09:08] Flashfire: ye? [09:08] above [09:08] you made the docker file thing didnt you? [09:08] nope [09:09] i wish i had access to the archiveteam org so that i could set up autobuilds [09:10] i can run multiple warriors on different ip's, right? [09:10] i've got to many underused VPS'es hanging around [09:10] yea [09:20] *** mtntmnky has quit IRC (Remote host closed the connection) [09:20] *** mtntmnky has joined #urlteam [09:46] Flashfire: Still looking for some shortener suggestions? I recently added a small handful of YOURLS based ones to the wiki [10:04] Any way to check stats for my own nick on the tracker? [10:06] Kagee: You can select the row limit on the tracker, set it to 5000 or so, and CTRL+F your nick. It's pretty heavy to render and update that many rows though, so your browser might be a bit sluggish [10:10] I wrote a small python program that gets the data from the websocket and displays some more interesting info like scans/hr, but the code is a bit too sloppy to post on github right now [11:13] So regarding goo.gl, I was hoping we could get URLTeam to write WARCs before starting that one. Although I guess we could still do that later and regrab them if necessary. The WARC thing probably won't happen anytime soon anyway. [11:20] yeah warcing would be so great for urlteam [11:20] https://github.com/ArchiveTeam/terroroftinytown/issues/1 [11:20] issue #1 [11:23] the thing is. how do we synchronize the warcs over to a master server? are we gonna do it in a similar fashion as the other tracker projects work by rsyncing them or do we push them to the tracker server? [11:27] I don't think we'll want to put even more work on the tracker, so something like the other projects is probably best. [12:49] Zerote: made a very ugly hack while working on mobile: wscat --connect "wss://tracker.archiveteam.org:1338/api/live_stats" 2>/dev/null | grep -m 1 Kagee | sed 's/{/\n{/' | tail -1 | jq '.lifetime["Kagee"] | "Found: \(.[0]), scanned: \(.[1])"' "Found: 38194, scanned: 215085" [14:05] *** Hani has quit IRC (Quit: Going offline, see ya! (www.adiirc.com)) [14:49] *** seatsea has joined #urlteam [18:10] *** ave_ has joined #urlteam [20:30] *** ave_ has quit IRC (Quit: Connection closed for inactivity) [21:46] *** Soni has joined #urlteam