Time |
Nickname |
Message |
01:19
🔗
|
|
Zerote has quit IRC (Ping timeout: 260 seconds) |
01:22
🔗
|
|
tech234a has joined #urlteam |
03:18
🔗
|
|
richid has joined #urlteam |
03:21
🔗
|
richid |
Hey folks, can I make a request to stop (or slow) a crawl? |
03:21
🔗
|
Flashfire |
You certainly can |
03:21
🔗
|
Flashfire |
What crawl would that be richid? |
03:22
🔗
|
Flashfire |
Who are you representing? |
03:22
🔗
|
richid |
In particular, https://like2b.uy |
03:22
🔗
|
Flashfire |
Certainly I can |
03:22
🔗
|
richid |
It looks like it, but it's not a URL shortener |
03:22
🔗
|
Flashfire |
I honestly forgot i turned that back on. It was supposed to stay off |
03:22
🔗
|
richid |
And if you take a look at the logs, it's all just 302s to the same URL |
03:23
🔗
|
Flashfire |
I will turn off auto queue and let the current jobs finish |
03:23
🔗
|
richid |
Looks like it already stopped :) |
03:23
🔗
|
Flashfire |
I think that one came from an instagram search there are a few that werent redirects to the same url but most of them are |
03:24
🔗
|
Flashfire |
Yeah the items were coming really fast |
03:24
🔗
|
Flashfire |
Do you have any reccomendations for url shorteners to start crawling richid? |
03:25
🔗
|
Flashfire |
I did look into it again it looks like they are redirects to store fronts through instagram but most of them are longer than the short codes we are scanning |
03:25
🔗
|
Flashfire |
sorry about that |
03:25
🔗
|
Flashfire |
But if you have any suggestions to add or turn back onto the tracker just let me know |
03:26
🔗
|
richid |
No worries, thanks for being so responsive |
03:26
🔗
|
Flashfire |
All Good I was watching Hulu when I saw the message pop up on my client |
03:27
🔗
|
richid |
After looking at your tracker, you've got way more shorteners in there than I even know about |
03:27
🔗
|
Flashfire |
So as I said if there are any shorteners you think should be added to the tracker or turned back on just give me a ping and I will see what I can do |
03:27
🔗
|
Flashfire |
Ahahaha yeah we have a fair few. I mainly stick to the simple ones I havent worked out coding just yet. But if you have any suggestions just send them my way |
03:30
🔗
|
richid |
Will do! and thanks for doing this work, I'm a big fan of archiving projects |
03:30
🔗
|
Flashfire |
my absolute pleasure. I am the same I came to archive team only a year or 2 ago and fell in love with the URLteam project |
03:31
🔗
|
Flashfire |
I always came across heaps in my journeys to the weird scummy parts of the net and was glad someone came to preserve them |
03:32
🔗
|
Somebody2 |
Thanks for being understanding, richid! |
03:33
🔗
|
Somebody2 |
Flashfire: re a.gd -- always better to add more info |
03:33
🔗
|
Flashfire |
Somebody2 do I clean up and delete the existing info or just tack the new info onto the end |
03:36
🔗
|
Flashfire |
also for something like 1click.im - DNS not responding as of 15:48, 15 May 2016 (EDT) is it ok if I update it to not responding as of 2019 or is that a stupid change to make Somebody2 |
03:37
🔗
|
|
richid has quit IRC (Quit: Page closed) |
03:37
🔗
|
|
odemg has quit IRC (Ping timeout: 615 seconds) |
03:43
🔗
|
Flashfire |
I am going to go ahead and do it you can revert my changes later if you prefer |
03:44
🔗
|
|
odemg has joined #urlteam |
04:12
🔗
|
|
tech234a has quit IRC (Quit: Connection closed for inactivity) |
04:25
🔗
|
Flashfire |
Does anyone know if we are starting the goo.gl scrape soon? |
04:26
🔗
|
Flashfire |
I might start it up on the very smallest queue when this current export finishes |
04:28
🔗
|
Flashfire |
Does anyone know if goo.gl uses _ |
04:31
🔗
|
|
warmwaffl has quit IRC (Remote host closed the connection) |
04:41
🔗
|
t3 |
Flashfire: I am trying to find that out for you. |
04:47
🔗
|
t3 |
I tried to find URLSs with `_`, but I did not find any with a quick search. I also tried to experiment using URLs with the underscore, but they did not redirect and only gave 404s. I do not think underscores were used. |
04:48
🔗
|
t3 |
I hope that helps. |
05:04
🔗
|
Fusl |
2018-12-25 06:53:37 Fusl FYI if this wasn't known yet, goo.gl urls currently are 4-6 chars A-Za-z0-9, 301 is redirect, 404 is not found, 403 is banned |
05:06
🔗
|
Flashfire |
no 3 characters at all Fusl? |
05:09
🔗
|
Flashfire |
Hmmm does goo.gl only work for https? |
05:11
🔗
|
|
tech234a has joined #urlteam |
05:14
🔗
|
Fusl |
two-three chars are reserved by google |
05:15
🔗
|
Flashfire |
But that means that they still exist maybe |
05:15
🔗
|
Fusl |
2018-12-25 06:55:30 Fusl 1-2 char urls are reserved for google internal use, 3 char urls don't seem to be used at all |
05:15
🔗
|
Fusl |
2018-12-25 07:02:27 Fusl 200 is for deleted URLs |
05:15
🔗
|
Fusl |
2018-12-25 07:11:38 Fusl here are my bets for how many URLs there are 301ing: 4: 11.60M, 5: 765.89M, 6: 7838.43M |
05:15
🔗
|
Fusl |
2018-12-25 12:28:49 @JAA Fusl: Thanks. Do you know whether the codes are incremental or random? |
05:15
🔗
|
Fusl |
2018-12-25 12:31:03 Fusl they're random |
05:15
🔗
|
Fusl |
2018-12-25 12:31:22 Fusl at least from my tests |
05:15
🔗
|
Flashfire |
I mean if they are used then maybe I could start them from 0 and have it super slow |
05:20
🔗
|
Flashfire |
Fusl do you think that 3 batches of 10 urls at a time will trigger rate limiting? |
05:21
🔗
|
Flashfire |
I have the project almost ready |
05:21
🔗
|
Fusl |
Flashfire: i can do a thousand requests within a short time without even getting a single 403 |
05:22
🔗
|
Flashfire |
Fusl arent you rotating IPs though? |
05:22
🔗
|
Flashfire |
Also I want to play it safe damn it |
05:22
🔗
|
Fusl |
tested on my laptop routed over a single ISP |
05:22
🔗
|
Flashfire |
I will set it up the same way I did Puri.na until I am confident |
05:23
🔗
|
Fusl |
also |
05:23
🔗
|
Fusl |
302 https://goo.gl/v6BJWm https://www.google.com/sorry/index?continue=https://goo.gl/v6BJWm&q=EgRR2W9pGLCT4OUFIhkA8aeDS3sL7I_rQ10HPgBD6zBENIpCeTnsMgFy |
05:23
🔗
|
Fusl |
banned ^ |
05:23
🔗
|
Fusl |
http://xor.meo.ws/Y7outygiuCXUlhYFXHGQDkCQh84uL6Yb.txt |
05:24
🔗
|
Flashfire |
when is that from? |
05:24
🔗
|
Flashfire |
I will add 302 to the banned list |
05:24
🔗
|
Fusl |
just now |
05:24
🔗
|
Fusl |
no |
05:24
🔗
|
Fusl |
302 is a redirect |
05:24
🔗
|
Fusl |
but |
05:24
🔗
|
Flashfire |
Setting it up same as Puri.na 10 urls in batches of 5? |
05:24
🔗
|
Fusl |
302 https://goo.gl/buJZZI http://www.pinkrod.com/videos/13912/horny-michelle-lay-having-fun-with-a-young-guy/ |
05:24
🔗
|
Fusl |
this one is good |
05:25
🔗
|
Fusl |
but a 302 to https://www.google.com/sorry/index* is not |
05:25
🔗
|
Flashfire |
Ok so I will make the queue and url batches low and see how I go |
05:25
🔗
|
Flashfire |
5 batches of 10 running at once should be fine |
05:26
🔗
|
Flashfire |
If you dont have any objections fusl I will start it |
05:26
🔗
|
Flashfire |
I will keep an eye on it for as long as I can |
05:27
🔗
|
Fusl |
Flashfire: are we able to detect a 302 to https://www.google.com/sorry/index* as banned? |
05:27
🔗
|
Fusl |
thats my biggest concern |
05:27
🔗
|
Flashfire |
Location header reject regular expression: |
05:27
🔗
|
Flashfire |
that might be helpful |
05:27
🔗
|
Flashfire |
I dont know exactly how all of the tracker works |
05:28
🔗
|
Flashfire |
I assumed if we kept the queue low enough then we wouldnt run into it |
05:28
🔗
|
Flashfire |
Keep an eye on it as best we can |
05:28
🔗
|
Fusl |
thats not really a good way to ensure quality of the grabs |
05:29
🔗
|
Flashfire |
I know but its the best I have unless I just leave the settings as is and wait for someone more experiences |
05:30
🔗
|
Fusl |
Python regex to match non-redirect values in the Location header. Use to reject things like links to the shortener's homepage. |
05:31
🔗
|
Flashfire |
Well if you want to log into the tracker you do have access |
05:31
🔗
|
Fusl |
yeah im looking at that right now |
05:35
🔗
|
Fusl |
custom code? |
05:36
🔗
|
Fusl |
location header regex doesnt seem to do what we want |
05:37
🔗
|
Fusl |
clients that run into the 302 banned redirect will just keep consuming items from the queue |
05:37
🔗
|
Flashfire |
well shit |
05:37
🔗
|
Flashfire |
well I can not write custom code I cant code |
05:37
🔗
|
Fusl |
ill take a look |
05:54
🔗
|
Flashfire |
added digbig |
05:59
🔗
|
Fusl |
k i think i got the custom script to capture that rate limit redirect, just gonna do some tests |
05:59
🔗
|
Flashfire |
ok feel free |
05:59
🔗
|
Flashfire |
I am just cleaning up error reports I will avoid the googl ones |
06:14
🔗
|
Fusl_ |
https://github.com/Fusl/terroroftinytown/commit/cbffc8c80319eebaf2155c4d44ebb26f3f139bcb this should do the trick, still figuring out how to do the tests tho |
07:20
🔗
|
|
tech234a has quit IRC (Quit: Connection closed for inactivity) |
07:40
🔗
|
Flashfire |
https://1n.pm/ |
07:41
🔗
|
Flashfire |
https://loo.lt/ |
07:42
🔗
|
Flashfire |
http://vrg.sk/ |
07:51
🔗
|
Fusl |
https://github.com/ArchiveTeam/terroroftinytown/pull/67 |
08:01
🔗
|
Flashfire |
Hey guys um Zapt.In is listed as dead on the wiki but its short urls are resolving |
08:01
🔗
|
Fusl |
chfoo: first time adding a service custom script to urlteam, can you take a look at this and lmk if i need anything else changed? |
08:03
🔗
|
Flashfire |
Wait hold on never mind they are set to redirect through the wayback machine |
08:29
🔗
|
|
Zerote has joined #urlteam |
09:02
🔗
|
Kagee |
The documentation for using the docker image with envirionment variables at https://hub.docker.com/r/archiveteam/warrior-dockerfile/ appears to have an "-e" that makes the command fail |
09:04
🔗
|
Kagee |
looks like it was fixed 17 days ago on github :/ |
09:07
🔗
|
Flashfire |
FUSL |
09:08
🔗
|
Fusl |
Flashfire: ye? |
09:08
🔗
|
Flashfire |
above |
09:08
🔗
|
Flashfire |
you made the docker file thing didnt you? |
09:08
🔗
|
Fusl |
nope |
09:09
🔗
|
Fusl |
i wish i had access to the archiveteam org so that i could set up autobuilds |
09:10
🔗
|
Kagee |
i can run multiple warriors on different ip's, right? |
09:10
🔗
|
Kagee |
i've got to many underused VPS'es hanging around |
09:10
🔗
|
Fusl |
yea |
09:20
🔗
|
|
mtntmnky has quit IRC (Remote host closed the connection) |
09:20
🔗
|
|
mtntmnky has joined #urlteam |
09:46
🔗
|
Zerote |
Flashfire: Still looking for some shortener suggestions? I recently added a small handful of YOURLS based ones to the wiki |
10:04
🔗
|
Kagee |
Any way to check stats for my own nick on the tracker? |
10:06
🔗
|
Zerote |
Kagee: You can select the row limit on the tracker, set it to 5000 or so, and CTRL+F your nick. It's pretty heavy to render and update that many rows though, so your browser might be a bit sluggish |
10:10
🔗
|
Zerote |
I wrote a small python program that gets the data from the websocket and displays some more interesting info like scans/hr, but the code is a bit too sloppy to post on github right now |
11:13
🔗
|
JAA |
So regarding goo.gl, I was hoping we could get URLTeam to write WARCs before starting that one. Although I guess we could still do that later and regrab them if necessary. The WARC thing probably won't happen anytime soon anyway. |
11:20
🔗
|
Fusl |
yeah warcing would be so great for urlteam |
11:20
🔗
|
Fusl |
https://github.com/ArchiveTeam/terroroftinytown/issues/1 |
11:20
🔗
|
Fusl |
issue #1 |
11:23
🔗
|
Fusl |
the thing is. how do we synchronize the warcs over to a master server? are we gonna do it in a similar fashion as the other tracker projects work by rsyncing them or do we push them to the tracker server? |
11:27
🔗
|
JAA |
I don't think we'll want to put even more work on the tracker, so something like the other projects is probably best. |
12:49
🔗
|
Kagee |
Zerote: made a very ugly hack while working on mobile: wscat --connect "wss://tracker.archiveteam.org:1338/api/live_stats" 2>/dev/null | grep -m 1 Kagee | sed 's/{/\n{/' | tail -1 | jq '.lifetime["Kagee"] | "Found: \(.[0]), scanned: \(.[1])"' "Found: 38194, scanned: 215085" |
14:05
🔗
|
|
Hani has quit IRC (Quit: Going offline, see ya! (www.adiirc.com)) |
14:49
🔗
|
|
seatsea has joined #urlteam |
18:10
🔗
|
|
ave_ has joined #urlteam |
20:30
🔗
|
|
ave_ has quit IRC (Quit: Connection closed for inactivity) |
21:46
🔗
|
|
Soni has joined #urlteam |