#urlteam 2019-04-18,Thu

↑back Search

Time Nickname Message
01:19 🔗 Zerote has quit IRC (Ping timeout: 260 seconds)
01:22 🔗 tech234a has joined #urlteam
03:18 🔗 richid has joined #urlteam
03:21 🔗 richid Hey folks, can I make a request to stop (or slow) a crawl?
03:21 🔗 Flashfire You certainly can
03:21 🔗 Flashfire What crawl would that be richid?
03:22 🔗 Flashfire Who are you representing?
03:22 🔗 richid In particular, https://like2b.uy
03:22 🔗 Flashfire Certainly I can
03:22 🔗 richid It looks like it, but it's not a URL shortener
03:22 🔗 Flashfire I honestly forgot i turned that back on. It was supposed to stay off
03:22 🔗 richid And if you take a look at the logs, it's all just 302s to the same URL
03:23 🔗 Flashfire I will turn off auto queue and let the current jobs finish
03:23 🔗 richid Looks like it already stopped :)
03:23 🔗 Flashfire I think that one came from an instagram search there are a few that werent redirects to the same url but most of them are
03:24 🔗 Flashfire Yeah the items were coming really fast
03:24 🔗 Flashfire Do you have any reccomendations for url shorteners to start crawling richid?
03:25 🔗 Flashfire I did look into it again it looks like they are redirects to store fronts through instagram but most of them are longer than the short codes we are scanning
03:25 🔗 Flashfire sorry about that
03:25 🔗 Flashfire But if you have any suggestions to add or turn back onto the tracker just let me know
03:26 🔗 richid No worries, thanks for being so responsive
03:26 🔗 Flashfire All Good I was watching Hulu when I saw the message pop up on my client
03:27 🔗 richid After looking at your tracker, you've got way more shorteners in there than I even know about
03:27 🔗 Flashfire So as I said if there are any shorteners you think should be added to the tracker or turned back on just give me a ping and I will see what I can do
03:27 🔗 Flashfire Ahahaha yeah we have a fair few. I mainly stick to the simple ones I havent worked out coding just yet. But if you have any suggestions just send them my way
03:30 🔗 richid Will do! and thanks for doing this work, I'm a big fan of archiving projects
03:30 🔗 Flashfire my absolute pleasure. I am the same I came to archive team only a year or 2 ago and fell in love with the URLteam project
03:31 🔗 Flashfire I always came across heaps in my journeys to the weird scummy parts of the net and was glad someone came to preserve them
03:32 🔗 Somebody2 Thanks for being understanding, richid!
03:33 🔗 Somebody2 Flashfire: re a.gd -- always better to add more info
03:33 🔗 Flashfire Somebody2 do I clean up and delete the existing info or just tack the new info onto the end
03:36 🔗 Flashfire also for something like 1click.im - DNS not responding as of 15:48, 15 May 2016 (EDT) is it ok if I update it to not responding as of 2019 or is that a stupid change to make Somebody2
03:37 🔗 richid has quit IRC (Quit: Page closed)
03:37 🔗 odemg has quit IRC (Ping timeout: 615 seconds)
03:43 🔗 Flashfire I am going to go ahead and do it you can revert my changes later if you prefer
03:44 🔗 odemg has joined #urlteam
04:12 🔗 tech234a has quit IRC (Quit: Connection closed for inactivity)
04:25 🔗 Flashfire Does anyone know if we are starting the goo.gl scrape soon?
04:26 🔗 Flashfire I might start it up on the very smallest queue when this current export finishes
04:28 🔗 Flashfire Does anyone know if goo.gl uses _
04:31 🔗 warmwaffl has quit IRC (Remote host closed the connection)
04:41 🔗 t3 Flashfire: I am trying to find that out for you.
04:47 🔗 t3 I tried to find URLSs with `_`, but I did not find any with a quick search. I also tried to experiment using URLs with the underscore, but they did not redirect and only gave 404s. I do not think underscores were used.
04:48 🔗 t3 I hope that helps.
05:04 🔗 Fusl 2018-12-25 06:53:37 Fusl FYI if this wasn't known yet, goo.gl urls currently are 4-6 chars A-Za-z0-9, 301 is redirect, 404 is not found, 403 is banned
05:06 🔗 Flashfire no 3 characters at all Fusl?
05:09 🔗 Flashfire Hmmm does goo.gl only work for https?
05:11 🔗 tech234a has joined #urlteam
05:14 🔗 Fusl two-three chars are reserved by google
05:15 🔗 Flashfire But that means that they still exist maybe
05:15 🔗 Fusl 2018-12-25 06:55:30 Fusl 1-2 char urls are reserved for google internal use, 3 char urls don't seem to be used at all
05:15 🔗 Fusl 2018-12-25 07:02:27 Fusl 200 is for deleted URLs
05:15 🔗 Fusl 2018-12-25 07:11:38 Fusl here are my bets for how many URLs there are 301ing: 4: 11.60M, 5: 765.89M, 6: 7838.43M
05:15 🔗 Fusl 2018-12-25 12:28:49 @JAA Fusl: Thanks. Do you know whether the codes are incremental or random?
05:15 🔗 Fusl 2018-12-25 12:31:03 Fusl they're random
05:15 🔗 Fusl 2018-12-25 12:31:22 Fusl at least from my tests
05:15 🔗 Flashfire I mean if they are used then maybe I could start them from 0 and have it super slow
05:20 🔗 Flashfire Fusl do you think that 3 batches of 10 urls at a time will trigger rate limiting?
05:21 🔗 Flashfire I have the project almost ready
05:21 🔗 Fusl Flashfire: i can do a thousand requests within a short time without even getting a single 403
05:22 🔗 Flashfire Fusl arent you rotating IPs though?
05:22 🔗 Flashfire Also I want to play it safe damn it
05:22 🔗 Fusl tested on my laptop routed over a single ISP
05:22 🔗 Flashfire I will set it up the same way I did Puri.na until I am confident
05:23 🔗 Fusl also
05:23 🔗 Fusl 302 https://goo.gl/v6BJWm https://www.google.com/sorry/index?continue=https://goo.gl/v6BJWm&q=EgRR2W9pGLCT4OUFIhkA8aeDS3sL7I_rQ10HPgBD6zBENIpCeTnsMgFy
05:23 🔗 Fusl banned ^
05:23 🔗 Fusl http://xor.meo.ws/Y7outygiuCXUlhYFXHGQDkCQh84uL6Yb.txt
05:24 🔗 Flashfire when is that from?
05:24 🔗 Flashfire I will add 302 to the banned list
05:24 🔗 Fusl just now
05:24 🔗 Fusl no
05:24 🔗 Fusl 302 is a redirect
05:24 🔗 Fusl but
05:24 🔗 Flashfire Setting it up same as Puri.na 10 urls in batches of 5?
05:24 🔗 Fusl 302 https://goo.gl/buJZZI http://www.pinkrod.com/videos/13912/horny-michelle-lay-having-fun-with-a-young-guy/
05:24 🔗 Fusl this one is good
05:25 🔗 Fusl but a 302 to https://www.google.com/sorry/index* is not
05:25 🔗 Flashfire Ok so I will make the queue and url batches low and see how I go
05:25 🔗 Flashfire 5 batches of 10 running at once should be fine
05:26 🔗 Flashfire If you dont have any objections fusl I will start it
05:26 🔗 Flashfire I will keep an eye on it for as long as I can
05:27 🔗 Fusl Flashfire: are we able to detect a 302 to https://www.google.com/sorry/index* as banned?
05:27 🔗 Fusl thats my biggest concern
05:27 🔗 Flashfire Location header reject regular expression:
05:27 🔗 Flashfire that might be helpful
05:27 🔗 Flashfire I dont know exactly how all of the tracker works
05:28 🔗 Flashfire I assumed if we kept the queue low enough then we wouldnt run into it
05:28 🔗 Flashfire Keep an eye on it as best we can
05:28 🔗 Fusl thats not really a good way to ensure quality of the grabs
05:29 🔗 Flashfire I know but its the best I have unless I just leave the settings as is and wait for someone more experiences
05:30 🔗 Fusl Python regex to match non-redirect values in the Location header. Use to reject things like links to the shortener's homepage.
05:31 🔗 Flashfire Well if you want to log into the tracker you do have access
05:31 🔗 Fusl yeah im looking at that right now
05:35 🔗 Fusl custom code?
05:36 🔗 Fusl location header regex doesnt seem to do what we want
05:37 🔗 Fusl clients that run into the 302 banned redirect will just keep consuming items from the queue
05:37 🔗 Flashfire well shit
05:37 🔗 Flashfire well I can not write custom code I cant code
05:37 🔗 Fusl ill take a look
05:54 🔗 Flashfire added digbig
05:59 🔗 Fusl k i think i got the custom script to capture that rate limit redirect, just gonna do some tests
05:59 🔗 Flashfire ok feel free
05:59 🔗 Flashfire I am just cleaning up error reports I will avoid the googl ones
06:14 🔗 Fusl_ https://github.com/Fusl/terroroftinytown/commit/cbffc8c80319eebaf2155c4d44ebb26f3f139bcb this should do the trick, still figuring out how to do the tests tho
07:20 🔗 tech234a has quit IRC (Quit: Connection closed for inactivity)
07:40 🔗 Flashfire https://1n.pm/
07:41 🔗 Flashfire https://loo.lt/
07:42 🔗 Flashfire http://vrg.sk/
07:51 🔗 Fusl https://github.com/ArchiveTeam/terroroftinytown/pull/67
08:01 🔗 Flashfire Hey guys um Zapt.In is listed as dead on the wiki but its short urls are resolving
08:01 🔗 Fusl chfoo: first time adding a service custom script to urlteam, can you take a look at this and lmk if i need anything else changed?
08:03 🔗 Flashfire Wait hold on never mind they are set to redirect through the wayback machine
08:29 🔗 Zerote has joined #urlteam
09:02 🔗 Kagee The documentation for using the docker image with envirionment variables at https://hub.docker.com/r/archiveteam/warrior-dockerfile/ appears to have an "-e" that makes the command fail
09:04 🔗 Kagee looks like it was fixed 17 days ago on github :/
09:07 🔗 Flashfire FUSL
09:08 🔗 Fusl Flashfire: ye?
09:08 🔗 Flashfire above
09:08 🔗 Flashfire you made the docker file thing didnt you?
09:08 🔗 Fusl nope
09:09 🔗 Fusl i wish i had access to the archiveteam org so that i could set up autobuilds
09:10 🔗 Kagee i can run multiple warriors on different ip's, right?
09:10 🔗 Kagee i've got to many underused VPS'es hanging around
09:10 🔗 Fusl yea
09:20 🔗 mtntmnky has quit IRC (Remote host closed the connection)
09:20 🔗 mtntmnky has joined #urlteam
09:46 🔗 Zerote Flashfire: Still looking for some shortener suggestions? I recently added a small handful of YOURLS based ones to the wiki
10:04 🔗 Kagee Any way to check stats for my own nick on the tracker?
10:06 🔗 Zerote Kagee: You can select the row limit on the tracker, set it to 5000 or so, and CTRL+F your nick. It's pretty heavy to render and update that many rows though, so your browser might be a bit sluggish
10:10 🔗 Zerote I wrote a small python program that gets the data from the websocket and displays some more interesting info like scans/hr, but the code is a bit too sloppy to post on github right now
11:13 🔗 JAA So regarding goo.gl, I was hoping we could get URLTeam to write WARCs before starting that one. Although I guess we could still do that later and regrab them if necessary. The WARC thing probably won't happen anytime soon anyway.
11:20 🔗 Fusl yeah warcing would be so great for urlteam
11:20 🔗 Fusl https://github.com/ArchiveTeam/terroroftinytown/issues/1
11:20 🔗 Fusl issue #1
11:23 🔗 Fusl the thing is. how do we synchronize the warcs over to a master server? are we gonna do it in a similar fashion as the other tracker projects work by rsyncing them or do we push them to the tracker server?
11:27 🔗 JAA I don't think we'll want to put even more work on the tracker, so something like the other projects is probably best.
12:49 🔗 Kagee Zerote: made a very ugly hack while working on mobile: wscat --connect "wss://tracker.archiveteam.org:1338/api/live_stats" 2>/dev/null | grep -m 1 Kagee | sed 's/{/\n{/' | tail -1 | jq '.lifetime["Kagee"] | "Found: \(.[0]), scanned: \(.[1])"' "Found: 38194, scanned: 215085"
14:05 🔗 Hani has quit IRC (Quit: Going offline, see ya! (www.adiirc.com))
14:49 🔗 seatsea has joined #urlteam
18:10 🔗 ave_ has joined #urlteam
20:30 🔗 ave_ has quit IRC (Quit: Connection closed for inactivity)
21:46 🔗 Soni has joined #urlteam

irclogger-viewer