| Time |
Nickname |
Message |
|
02:36
🔗
|
|
VariXx has quit IRC (Quit: moo) |
|
03:05
🔗
|
Somebody2 |
Flashfire: I don't understand your question? It doens't matter how many shortcodes in a set return results, the AutoQueue treats it the same. |
|
03:07
🔗
|
Flashfire |
I mean for the non incremental batches. For example if batch 48 returns 35/50 Successful but batch 49 returns 50/50 successful will batch 49 be queued again when batch 48 is tried again at a later date |
|
03:08
🔗
|
Flashfire |
Somebody2 does that explain it any more |
|
03:25
🔗
|
Somebody2 |
The non-incremental ones are just using a hash to convert the seq#s to shortcodes -- it's otherwise the same, |
|
03:26
🔗
|
Somebody2 |
so it doesn't do any tracking of what gets returned |
|
03:26
🔗
|
Somebody2 |
just that it got back info about the batch |
|
03:26
🔗
|
Somebody2 |
does that answer your question? |
|
03:27
🔗
|
Somebody2 |
I think there may still be some mutual misunderstanding going on |
|
03:32
🔗
|
Flashfire |
but Somebody2 if a batch returns all successful codes is that batch retired as such |
|
03:50
🔗
|
|
kiska1 has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
wmvhater has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
Mayonaise has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
TigerbotH has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
treora has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
BnAboyZ has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
Kaz has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
Jusque has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
teej_ has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
chr1sm has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
HCross has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
voltagex_ has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
aard has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
bakJAA has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
Ctrl-S_ has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
deathy has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
eLbot has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
kpcyrd has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
buckket has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
Hecatz has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
zhongfu has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
Muad-Dib has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
moufu has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
Frogging has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
jornane has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
hook54321 has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
mr_archiv has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
joepie91 has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
z00nx has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
w0rmhole has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
kiska has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
Flashfire has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
dashcloud has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
Boppen has quit IRC (hub.efnet.us ny.us.hub) |
|
03:50
🔗
|
|
dashcloud has joined #urlteam |
|
03:50
🔗
|
|
kiska1 has joined #urlteam |
|
03:50
🔗
|
|
wmvhater has joined #urlteam |
|
03:50
🔗
|
|
moufu has joined #urlteam |
|
03:50
🔗
|
|
Jusque has joined #urlteam |
|
03:50
🔗
|
|
Frogging has joined #urlteam |
|
03:50
🔗
|
|
BnAboyZ has joined #urlteam |
|
03:50
🔗
|
|
treora has joined #urlteam |
|
03:50
🔗
|
|
teej_ has joined #urlteam |
|
03:50
🔗
|
|
chr1sm has joined #urlteam |
|
03:50
🔗
|
|
HCross has joined #urlteam |
|
03:50
🔗
|
|
voltagex_ has joined #urlteam |
|
03:50
🔗
|
|
Mayonaise has joined #urlteam |
|
03:50
🔗
|
|
buckket has joined #urlteam |
|
03:50
🔗
|
|
Boppen has joined #urlteam |
|
03:50
🔗
|
|
jornane has joined #urlteam |
|
03:50
🔗
|
|
hook54321 has joined #urlteam |
|
03:50
🔗
|
|
TigerbotH has joined #urlteam |
|
03:50
🔗
|
|
aard has joined #urlteam |
|
03:50
🔗
|
|
bakJAA has joined #urlteam |
|
03:50
🔗
|
|
Ctrl-S_ has joined #urlteam |
|
03:50
🔗
|
|
deathy has joined #urlteam |
|
03:50
🔗
|
|
Kaz has joined #urlteam |
|
03:50
🔗
|
|
eLbot has joined #urlteam |
|
03:50
🔗
|
|
kpcyrd has joined #urlteam |
|
03:50
🔗
|
|
Hecatz has joined #urlteam |
|
03:50
🔗
|
|
zhongfu has joined #urlteam |
|
03:50
🔗
|
|
Muad-Dib has joined #urlteam |
|
03:50
🔗
|
|
mr_archiv has joined #urlteam |
|
03:50
🔗
|
|
joepie91 has joined #urlteam |
|
03:50
🔗
|
|
z00nx has joined #urlteam |
|
03:50
🔗
|
|
w0rmhole has joined #urlteam |
|
03:50
🔗
|
|
kiska has joined #urlteam |
|
03:50
🔗
|
|
Flashfire has joined #urlteam |
|
03:50
🔗
|
|
ny.us.hub sets mode: +oo HCross bakJAA |
|
03:51
🔗
|
|
JAA sets mode: +o bakJAA |
|
03:52
🔗
|
|
bakJAA sets mode: +o JAA |
|
04:05
🔗
|
|
teej_ is now known as tees |
|
04:05
🔗
|
|
tees is now known as teej_ |
|
04:12
🔗
|
|
marked has left |
|
04:13
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
|
04:14
🔗
|
|
dashcloud has quit IRC (Ping timeout: 492 seconds) |
|
04:15
🔗
|
Somebody2 |
There's no "retired". The AutoQueue feature just keeps creating new batches with incremented sequence numbers. Everything else is manual. |
|
04:16
🔗
|
Somebody2 |
Each batch is sent out to a single warrior, then the tracker waits for a specified amount of time (generally half an hour) for the warrior to report back. |
|
04:16
🔗
|
|
teej_ is now known as te3j |
|
04:17
🔗
|
Somebody2 |
If it does, then the batch is done, no matter what the contents of the results are. |
|
04:17
🔗
|
Somebody2 |
If the warrior times out, or returns an error, the batch is sent out to a different warrior. |
|
04:20
🔗
|
|
te3j is now known as teej_____ |
|
04:21
🔗
|
|
teej_____ is now known as t3 |
|
04:29
🔗
|
|
odemg has joined #urlteam |
|
05:45
🔗
|
|
logchfoo0 starts logging #urlteam at Sat Dec 15 05:45:12 2018 |
|
05:45
🔗
|
|
logchfoo0 has joined #urlteam |
|
08:21
🔗
|
|
JAA has quit IRC (Ping timeout: 246 seconds) |
|
08:21
🔗
|
|
svchfoo1 has quit IRC (Ping timeout: 246 seconds) |
|
08:21
🔗
|
|
odemg has quit IRC (Ping timeout: 246 seconds) |
|
09:20
🔗
|
|
svchfoo1 has joined #urlteam |
|
09:20
🔗
|
|
JAA has joined #urlteam |
|
09:20
🔗
|
|
bakJAA sets mode: +o JAA |
|
09:21
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
|
09:27
🔗
|
|
odemg has joined #urlteam |
|
09:27
🔗
|
|
chfoo has quit IRC (Read error: Operation timed out) |
|
09:30
🔗
|
|
chfoo has joined #urlteam |
|
10:41
🔗
|
|
VariXx has joined #urlteam |
|
13:38
🔗
|
|
psi has joined #urlteam |
|
14:13
🔗
|
|
dashcloud has joined #urlteam |
|
14:30
🔗
|
|
trvz has joined #urlteam |
|
14:31
🔗
|
trvz |
Fusl: who says burstable |
|
14:34
🔗
|
|
dashcloud has quit IRC (Ping timeout: 265 seconds) |
|
14:35
🔗
|
Fusl |
you used google cloud, right? |
|
14:35
🔗
|
|
ave_ has joined #urlteam |
|
14:35
🔗
|
|
yano has joined #urlteam |
|
14:35
🔗
|
yano |
hai |
|
14:35
🔗
|
Fusl |
hi |
|
14:35
🔗
|
|
jut has joined #urlteam |
|
14:36
🔗
|
|
erine has joined #urlteam |
|
14:36
🔗
|
Fusl |
let me quickly figure out if this all works properly with the docker warrior |
|
14:36
🔗
|
yano |
ack |
|
14:36
🔗
|
Fusl |
if it does, you can run it with for i in $(seq NUM_OF_THREADS); do docker container run -d --rm -e DOWNLOADER=Fusl -e SELECTED_PROJECT=urlteam2 -e CONCURRENT_ITEMS=1 archiveteam/warrior-dockerfile; done |
|
14:37
🔗
|
Fusl |
seems to work fine |
|
14:37
🔗
|
Fusl |
happy scraping |
|
14:39
🔗
|
|
diggan has joined #urlteam |
|
14:42
🔗
|
yano |
to install docker on a Debian 9 system: |
|
14:42
🔗
|
yano |
apt -y install apt-transport-https ca-certificates curl gnupg2 software-properties-common; curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add -; add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"; apt update; apt -y install docker-ce |
|
14:44
🔗
|
Fusl |
actually |
|
14:44
🔗
|
Fusl |
curl https://get.docker.com | bash |
|
14:45
🔗
|
yano |
oh |
|
14:45
🔗
|
yano |
hah |
|
14:45
🔗
|
yano |
i took my line from the docker docs and just combined their steps in to one long liner |
|
14:49
🔗
|
ave_ |
Anyone here using hetzner? Did you get abuse warnings from them? I'll spin up 4-5 more warriors on my hetzner vpses if they don't issue abuse warnings for continuously scraping url shortening sites. |
|
14:50
🔗
|
yano |
i'm using google cloud platform |
|
14:58
🔗
|
trvz |
what are limitations here? good concurrency numbers? |
|
15:01
🔗
|
psi |
you don't want too much concurrency per ip |
|
15:01
🔗
|
Fusl |
yeah, they tend to blacklist very quickly |
|
15:01
🔗
|
Fusl |
so as long as you have an unlimited set of ips, everything is well |
|
15:02
🔗
|
|
coldon2dr has joined #urlteam |
|
15:03
🔗
|
trvz |
what's too much? |
|
15:16
🔗
|
Fusl |
that depends on the service |
|
15:16
🔗
|
Fusl |
bit.ly blocks very fast |
|
15:16
🔗
|
Fusl |
i had one thread running on a single ip, got banned within a few hours |
|
15:21
🔗
|
JAA |
Yeah, we should look into increasing the delay on bit.ly probably. |
|
15:23
🔗
|
JAA |
In general, the tracker only hands out one item per shortener and IP address. So running more than 5 concurrency on one IP is useless currently (because we have 5 shorteners active right now). |
|
16:20
🔗
|
Fusl |
why is it not more than 5 active tho? |
|
16:28
🔗
|
|
odemg has quit IRC (Read error: Connection reset by peer) |
|
16:53
🔗
|
JAA |
Because noone configured more shorteners. |
|
16:59
🔗
|
|
LFlare43 has joined #urlteam |
|
16:59
🔗
|
|
LFlare43 is now known as LFlare |
|
18:15
🔗
|
|
odemg has joined #urlteam |
|
18:16
🔗
|
|
odemg_ has joined #urlteam |
|
18:16
🔗
|
|
odemg_ has quit IRC (Remote host closed the connection) |
|
18:42
🔗
|
|
n00b_42 has joined #urlteam |
|
18:46
🔗
|
Somebody2 |
If someone wants to dig into a fix for this error, that would be VERY APPRECIATED: "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 7: invalid continuation byte" |
|
18:46
🔗
|
Somebody2 |
Here's the rest of the traceback: https://hastebin.com/wiqafihadi.sql |
|
18:47
🔗
|
|
n00b_42 has quit IRC (Quit: http://chat.efnet.org (Ping timeout)) |
|
18:47
🔗
|
Somebody2 |
I'll see about spinning up a bunch more projects to make use of the all the warriors we've got now. |
|
18:49
🔗
|
Somebody2 |
turning on xco to see what happens |
|
18:51
🔗
|
Somebody2 |
and waa-ai |
|
18:52
🔗
|
Somebody2 |
and pub-vitrue-com |
|
18:52
🔗
|
Somebody2 |
probably most of these won't actuall get us useful data, but WTH... |
|
18:54
🔗
|
Somebody2 |
and u4-hu |
|
18:55
🔗
|
Somebody2 |
and tighturl-com |
|
18:56
🔗
|
Somebody2 |
turned isgd_6 back on, I think the warriors have been updated to handle HTTPS |
|
19:03
🔗
|
Somebody2 |
turned on shar-es too |
|
19:03
🔗
|
Somebody2 |
there are a BUNCH of nice features I'd like to add to the admin interface of urlteam... |
|
19:04
🔗
|
Somebody2 |
like a *notes* field built into the admin interface |
|
19:04
🔗
|
Somebody2 |
rather than having to check back on the wiki |
|
19:04
🔗
|
Somebody2 |
and (if I'm dreaming) various summary info about which shortcode ranges we've already tried, and when |
|
19:04
🔗
|
Somebody2 |
if someone who codes in Python wanted to help... |
|
19:06
🔗
|
Somebody2 |
and turned on trap-it |
|
19:07
🔗
|
Somebody2 |
So far, waa-ai, u4-hu, and tighturl-com have returned no results. |
|
19:08
🔗
|
Somebody2 |
and u4-hu is throwing some errors, so I'm turning it off |
|
19:10
🔗
|
Somebody2 |
so is waa-ai, so it's going off, too |
|
19:12
🔗
|
Somebody2 |
trap-it is returning 301s that we didn't expect; now fixing that, and it should start generating results |
|
19:13
🔗
|
Somebody2 |
added vbly-us |
|
19:23
🔗
|
Somebody2 |
turning on mysp-ac |
|
19:25
🔗
|
Somebody2 |
trap-it seems to just redirect all requests to scribblelive.com |
|
19:25
🔗
|
|
mtntmnky has joined #urlteam |
|
19:25
🔗
|
Somebody2 |
turning it off |
|
19:27
🔗
|
Somebody2 |
OK, we're up to around 705 scans/sec -- I'm going to step away for a while. If/when everything blows up, feel free to ping me. |
|
19:27
🔗
|
Somebody2 |
mysp-ac is returning results |
|
19:27
🔗
|
psi |
Seeing some of these `Error communicating with tracker: HTTPConnectionPool(host='tracker.archiveteam.org', port=1337): Read timed out. (read timeout=60).` when uploading |
|
19:28
🔗
|
Somebody2 |
yep, that happens |
|
19:28
🔗
|
Somebody2 |
we're pushing the tracker hard |
|
19:28
🔗
|
psi |
ah :P |
|
19:42
🔗
|
kpcyrd |
Somebody2: I didn't read the whole backlog, but are you on linux, are locales configured and $LANG set correctly? |
|
19:42
🔗
|
kpcyrd |
that's not the case in docker for example and python doesn't really like that for some reason |
|
19:44
🔗
|
Somebody2 |
kpcyrd: that is certainly possible! |
|
19:44
🔗
|
Somebody2 |
This is not on my local box, but on the URLTeam client code running on the warriors. |
|
19:44
🔗
|
Somebody2 |
The repo that you'd be looking at is: https://github.com/ArchiveTeam/terroroftinytown |
|
19:45
🔗
|
Somebody2 |
let me know if/how else I can help you get started investigating it more (should you wish to) |
|
19:46
🔗
|
kpcyrd |
I'm having trouble loading the traceback for some reason |
|
19:50
🔗
|
kpcyrd |
Somebody2: does your hastebin link work for you? It seems there's an issue with their server right now |
|
19:52
🔗
|
|
pnJay has joined #urlteam |
|
19:53
🔗
|
JAA |
Doesn't work for me either. |
|
19:56
🔗
|
pnJay |
how many concurrent per ip for this project? |
|
19:58
🔗
|
Somebody2 |
hm, let me put it somewhere else |
|
19:58
🔗
|
Somebody2 |
pnJay: doesn't matter |
|
19:58
🔗
|
Somebody2 |
as many or as few as you wish |
|
19:58
🔗
|
Somebody2 |
more than 12 is useless |
|
19:58
🔗
|
pnJay |
12 was the answer I was looking for, I guess. Thanks! |
|
19:59
🔗
|
Somebody2 |
I can raise it higher by turning on more projects |
|
20:00
🔗
|
Somebody2 |
does this work for the traceback? https://0bin.net/paste/Jaxv8+B3BSnKad0w#-+R683qB8HzivzsboLxiIAc1WyxTw+6nsH7ZaIsYr3j |
|
20:05
🔗
|
pnJay |
Will I break anything or piss anyone off if I point a buncha warriors at this? |
|
20:08
🔗
|
kpcyrd |
Somebody2: works now, does this happen for every url or just a few? |
|
20:11
🔗
|
kpcyrd |
Somebody2: and do you have the url that was requested in the logs? |
|
20:13
🔗
|
Somebody2 |
intermittantly |
|
20:14
🔗
|
Somebody2 |
here's the range of URLs -- it happened for one of them, I can't tell exactly which |
|
20:17
🔗
|
Somebody2 |
one thing I could do is make the ranges smaller, temporarily, to make it easier to figure out which ones are borking |
|
20:18
🔗
|
Somebody2 |
oh, and I know it's intermittant, because if we retry the same set of URLs enough times, I don't get the error |
|
20:18
🔗
|
Somebody2 |
OK, here's a range that it happened in: 38gr1cn TO 38gr1db |
|
20:19
🔗
|
Somebody2 |
that's 50 shortcodes |
|
20:19
🔗
|
Somebody2 |
cn, co, cp, cq ... d8, d9, da, db |
|
20:20
🔗
|
Somebody2 |
and this is for tinyurl.com |
|
20:20
🔗
|
Somebody2 |
whoops, sorry, 25, not 50 |
|
20:21
🔗
|
Somebody2 |
pnJay: nope, you are welcome to point a bunch of Warriors at this -- they just may not get very well used! |
|
20:28
🔗
|
|
caff_ has joined #urlteam |
|
20:32
🔗
|
|
caff has quit IRC (Read error: Operation timed out) |
|
20:42
🔗
|
psi |
i have 12 concs going at this now |
|
20:42
🔗
|
psi |
on 2 hosts |
|
20:49
🔗
|
|
hook54321 has quit IRC (Quit: Connection closed for inactivity) |
|
22:18
🔗
|
erine |
ave_: Wait, really? |
|
22:19
🔗
|
erine |
How long does it take to get the notices |
|
22:57
🔗
|
Somebody2 |
erine: what? |
|
22:57
🔗
|
erine |
Hetzner warning notices? |
|
22:58
🔗
|
Somebody2 |
Ah, I don't know anything about those. |
|
23:28
🔗
|
|
hook54321 has joined #urlteam |