Time |
Nickname |
Message |
02:36
🔗
|
|
VariXx has quit IRC (Quit: moo) |
03:05
🔗
|
Somebody2 |
Flashfire: I don't understand your question? It doens't matter how many shortcodes in a set return results, the AutoQueue treats it the same. |
03:07
🔗
|
Flashfire |
I mean for the non incremental batches. For example if batch 48 returns 35/50 Successful but batch 49 returns 50/50 successful will batch 49 be queued again when batch 48 is tried again at a later date |
03:08
🔗
|
Flashfire |
Somebody2 does that explain it any more |
03:25
🔗
|
Somebody2 |
The non-incremental ones are just using a hash to convert the seq#s to shortcodes -- it's otherwise the same, |
03:26
🔗
|
Somebody2 |
so it doesn't do any tracking of what gets returned |
03:26
🔗
|
Somebody2 |
just that it got back info about the batch |
03:26
🔗
|
Somebody2 |
does that answer your question? |
03:27
🔗
|
Somebody2 |
I think there may still be some mutual misunderstanding going on |
03:32
🔗
|
Flashfire |
but Somebody2 if a batch returns all successful codes is that batch retired as such |
03:50
🔗
|
|
kiska1 has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
wmvhater has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
Mayonaise has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
TigerbotH has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
treora has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
BnAboyZ has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
Kaz has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
Jusque has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
teej_ has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
chr1sm has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
HCross has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
voltagex_ has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
aard has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
bakJAA has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
Ctrl-S_ has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
deathy has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
eLbot has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
kpcyrd has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
buckket has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
Hecatz has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
zhongfu has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
Muad-Dib has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
moufu has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
Frogging has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
jornane has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
hook54321 has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
mr_archiv has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
joepie91 has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
z00nx has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
w0rmhole has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
kiska has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
Flashfire has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
dashcloud has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
Boppen has quit IRC (hub.efnet.us ny.us.hub) |
03:50
🔗
|
|
dashcloud has joined #urlteam |
03:50
🔗
|
|
kiska1 has joined #urlteam |
03:50
🔗
|
|
wmvhater has joined #urlteam |
03:50
🔗
|
|
moufu has joined #urlteam |
03:50
🔗
|
|
Jusque has joined #urlteam |
03:50
🔗
|
|
Frogging has joined #urlteam |
03:50
🔗
|
|
BnAboyZ has joined #urlteam |
03:50
🔗
|
|
treora has joined #urlteam |
03:50
🔗
|
|
teej_ has joined #urlteam |
03:50
🔗
|
|
chr1sm has joined #urlteam |
03:50
🔗
|
|
HCross has joined #urlteam |
03:50
🔗
|
|
voltagex_ has joined #urlteam |
03:50
🔗
|
|
Mayonaise has joined #urlteam |
03:50
🔗
|
|
buckket has joined #urlteam |
03:50
🔗
|
|
Boppen has joined #urlteam |
03:50
🔗
|
|
jornane has joined #urlteam |
03:50
🔗
|
|
hook54321 has joined #urlteam |
03:50
🔗
|
|
TigerbotH has joined #urlteam |
03:50
🔗
|
|
aard has joined #urlteam |
03:50
🔗
|
|
bakJAA has joined #urlteam |
03:50
🔗
|
|
Ctrl-S_ has joined #urlteam |
03:50
🔗
|
|
deathy has joined #urlteam |
03:50
🔗
|
|
Kaz has joined #urlteam |
03:50
🔗
|
|
eLbot has joined #urlteam |
03:50
🔗
|
|
kpcyrd has joined #urlteam |
03:50
🔗
|
|
Hecatz has joined #urlteam |
03:50
🔗
|
|
zhongfu has joined #urlteam |
03:50
🔗
|
|
Muad-Dib has joined #urlteam |
03:50
🔗
|
|
mr_archiv has joined #urlteam |
03:50
🔗
|
|
joepie91 has joined #urlteam |
03:50
🔗
|
|
z00nx has joined #urlteam |
03:50
🔗
|
|
w0rmhole has joined #urlteam |
03:50
🔗
|
|
kiska has joined #urlteam |
03:50
🔗
|
|
Flashfire has joined #urlteam |
03:50
🔗
|
|
ny.us.hub sets mode: +oo HCross bakJAA |
03:51
🔗
|
|
JAA sets mode: +o bakJAA |
03:52
🔗
|
|
bakJAA sets mode: +o JAA |
04:05
🔗
|
|
teej_ is now known as tees |
04:05
🔗
|
|
tees is now known as teej_ |
04:12
🔗
|
|
marked has left |
04:13
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
04:14
🔗
|
|
dashcloud has quit IRC (Ping timeout: 492 seconds) |
04:15
🔗
|
Somebody2 |
There's no "retired". The AutoQueue feature just keeps creating new batches with incremented sequence numbers. Everything else is manual. |
04:16
🔗
|
Somebody2 |
Each batch is sent out to a single warrior, then the tracker waits for a specified amount of time (generally half an hour) for the warrior to report back. |
04:16
🔗
|
|
teej_ is now known as te3j |
04:17
🔗
|
Somebody2 |
If it does, then the batch is done, no matter what the contents of the results are. |
04:17
🔗
|
Somebody2 |
If the warrior times out, or returns an error, the batch is sent out to a different warrior. |
04:20
🔗
|
|
te3j is now known as teej_____ |
04:21
🔗
|
|
teej_____ is now known as t3 |
04:29
🔗
|
|
odemg has joined #urlteam |
05:45
🔗
|
|
logchfoo0 starts logging #urlteam at Sat Dec 15 05:45:12 2018 |
05:45
🔗
|
|
logchfoo0 has joined #urlteam |
08:21
🔗
|
|
JAA has quit IRC (Ping timeout: 246 seconds) |
08:21
🔗
|
|
svchfoo1 has quit IRC (Ping timeout: 246 seconds) |
08:21
🔗
|
|
odemg has quit IRC (Ping timeout: 246 seconds) |
09:20
🔗
|
|
svchfoo1 has joined #urlteam |
09:20
🔗
|
|
JAA has joined #urlteam |
09:20
🔗
|
|
bakJAA sets mode: +o JAA |
09:21
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
09:27
🔗
|
|
odemg has joined #urlteam |
09:27
🔗
|
|
chfoo has quit IRC (Read error: Operation timed out) |
09:30
🔗
|
|
chfoo has joined #urlteam |
10:41
🔗
|
|
VariXx has joined #urlteam |
13:38
🔗
|
|
psi has joined #urlteam |
14:13
🔗
|
|
dashcloud has joined #urlteam |
14:30
🔗
|
|
trvz has joined #urlteam |
14:31
🔗
|
trvz |
Fusl: who says burstable |
14:34
🔗
|
|
dashcloud has quit IRC (Ping timeout: 265 seconds) |
14:35
🔗
|
Fusl |
you used google cloud, right? |
14:35
🔗
|
|
ave_ has joined #urlteam |
14:35
🔗
|
|
yano has joined #urlteam |
14:35
🔗
|
yano |
hai |
14:35
🔗
|
Fusl |
hi |
14:35
🔗
|
|
jut has joined #urlteam |
14:36
🔗
|
|
erine has joined #urlteam |
14:36
🔗
|
Fusl |
let me quickly figure out if this all works properly with the docker warrior |
14:36
🔗
|
yano |
ack |
14:36
🔗
|
Fusl |
if it does, you can run it with for i in $(seq NUM_OF_THREADS); do docker container run -d --rm -e DOWNLOADER=Fusl -e SELECTED_PROJECT=urlteam2 -e CONCURRENT_ITEMS=1 archiveteam/warrior-dockerfile; done |
14:37
🔗
|
Fusl |
seems to work fine |
14:37
🔗
|
Fusl |
happy scraping |
14:39
🔗
|
|
diggan has joined #urlteam |
14:42
🔗
|
yano |
to install docker on a Debian 9 system: |
14:42
🔗
|
yano |
apt -y install apt-transport-https ca-certificates curl gnupg2 software-properties-common; curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add -; add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"; apt update; apt -y install docker-ce |
14:44
🔗
|
Fusl |
actually |
14:44
🔗
|
Fusl |
curl https://get.docker.com | bash |
14:45
🔗
|
yano |
oh |
14:45
🔗
|
yano |
hah |
14:45
🔗
|
yano |
i took my line from the docker docs and just combined their steps in to one long liner |
14:49
🔗
|
ave_ |
Anyone here using hetzner? Did you get abuse warnings from them? I'll spin up 4-5 more warriors on my hetzner vpses if they don't issue abuse warnings for continuously scraping url shortening sites. |
14:50
🔗
|
yano |
i'm using google cloud platform |
14:58
🔗
|
trvz |
what are limitations here? good concurrency numbers? |
15:01
🔗
|
psi |
you don't want too much concurrency per ip |
15:01
🔗
|
Fusl |
yeah, they tend to blacklist very quickly |
15:01
🔗
|
Fusl |
so as long as you have an unlimited set of ips, everything is well |
15:02
🔗
|
|
coldon2dr has joined #urlteam |
15:03
🔗
|
trvz |
what's too much? |
15:16
🔗
|
Fusl |
that depends on the service |
15:16
🔗
|
Fusl |
bit.ly blocks very fast |
15:16
🔗
|
Fusl |
i had one thread running on a single ip, got banned within a few hours |
15:21
🔗
|
JAA |
Yeah, we should look into increasing the delay on bit.ly probably. |
15:23
🔗
|
JAA |
In general, the tracker only hands out one item per shortener and IP address. So running more than 5 concurrency on one IP is useless currently (because we have 5 shorteners active right now). |
16:20
🔗
|
Fusl |
why is it not more than 5 active tho? |
16:28
🔗
|
|
odemg has quit IRC (Read error: Connection reset by peer) |
16:53
🔗
|
JAA |
Because noone configured more shorteners. |
16:59
🔗
|
|
LFlare43 has joined #urlteam |
16:59
🔗
|
|
LFlare43 is now known as LFlare |
18:15
🔗
|
|
odemg has joined #urlteam |
18:16
🔗
|
|
odemg_ has joined #urlteam |
18:16
🔗
|
|
odemg_ has quit IRC (Remote host closed the connection) |
18:42
🔗
|
|
n00b_42 has joined #urlteam |
18:46
🔗
|
Somebody2 |
If someone wants to dig into a fix for this error, that would be VERY APPRECIATED: "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 7: invalid continuation byte" |
18:46
🔗
|
Somebody2 |
Here's the rest of the traceback: https://hastebin.com/wiqafihadi.sql |
18:47
🔗
|
|
n00b_42 has quit IRC (Quit: http://chat.efnet.org (Ping timeout)) |
18:47
🔗
|
Somebody2 |
I'll see about spinning up a bunch more projects to make use of the all the warriors we've got now. |
18:49
🔗
|
Somebody2 |
turning on xco to see what happens |
18:51
🔗
|
Somebody2 |
and waa-ai |
18:52
🔗
|
Somebody2 |
and pub-vitrue-com |
18:52
🔗
|
Somebody2 |
probably most of these won't actuall get us useful data, but WTH... |
18:54
🔗
|
Somebody2 |
and u4-hu |
18:55
🔗
|
Somebody2 |
and tighturl-com |
18:56
🔗
|
Somebody2 |
turned isgd_6 back on, I think the warriors have been updated to handle HTTPS |
19:03
🔗
|
Somebody2 |
turned on shar-es too |
19:03
🔗
|
Somebody2 |
there are a BUNCH of nice features I'd like to add to the admin interface of urlteam... |
19:04
🔗
|
Somebody2 |
like a *notes* field built into the admin interface |
19:04
🔗
|
Somebody2 |
rather than having to check back on the wiki |
19:04
🔗
|
Somebody2 |
and (if I'm dreaming) various summary info about which shortcode ranges we've already tried, and when |
19:04
🔗
|
Somebody2 |
if someone who codes in Python wanted to help... |
19:06
🔗
|
Somebody2 |
and turned on trap-it |
19:07
🔗
|
Somebody2 |
So far, waa-ai, u4-hu, and tighturl-com have returned no results. |
19:08
🔗
|
Somebody2 |
and u4-hu is throwing some errors, so I'm turning it off |
19:10
🔗
|
Somebody2 |
so is waa-ai, so it's going off, too |
19:12
🔗
|
Somebody2 |
trap-it is returning 301s that we didn't expect; now fixing that, and it should start generating results |
19:13
🔗
|
Somebody2 |
added vbly-us |
19:23
🔗
|
Somebody2 |
turning on mysp-ac |
19:25
🔗
|
Somebody2 |
trap-it seems to just redirect all requests to scribblelive.com |
19:25
🔗
|
|
mtntmnky has joined #urlteam |
19:25
🔗
|
Somebody2 |
turning it off |
19:27
🔗
|
Somebody2 |
OK, we're up to around 705 scans/sec -- I'm going to step away for a while. If/when everything blows up, feel free to ping me. |
19:27
🔗
|
Somebody2 |
mysp-ac is returning results |
19:27
🔗
|
psi |
Seeing some of these `Error communicating with tracker: HTTPConnectionPool(host='tracker.archiveteam.org', port=1337): Read timed out. (read timeout=60).` when uploading |
19:28
🔗
|
Somebody2 |
yep, that happens |
19:28
🔗
|
Somebody2 |
we're pushing the tracker hard |
19:28
🔗
|
psi |
ah :P |
19:42
🔗
|
kpcyrd |
Somebody2: I didn't read the whole backlog, but are you on linux, are locales configured and $LANG set correctly? |
19:42
🔗
|
kpcyrd |
that's not the case in docker for example and python doesn't really like that for some reason |
19:44
🔗
|
Somebody2 |
kpcyrd: that is certainly possible! |
19:44
🔗
|
Somebody2 |
This is not on my local box, but on the URLTeam client code running on the warriors. |
19:44
🔗
|
Somebody2 |
The repo that you'd be looking at is: https://github.com/ArchiveTeam/terroroftinytown |
19:45
🔗
|
Somebody2 |
let me know if/how else I can help you get started investigating it more (should you wish to) |
19:46
🔗
|
kpcyrd |
I'm having trouble loading the traceback for some reason |
19:50
🔗
|
kpcyrd |
Somebody2: does your hastebin link work for you? It seems there's an issue with their server right now |
19:52
🔗
|
|
pnJay has joined #urlteam |
19:53
🔗
|
JAA |
Doesn't work for me either. |
19:56
🔗
|
pnJay |
how many concurrent per ip for this project? |
19:58
🔗
|
Somebody2 |
hm, let me put it somewhere else |
19:58
🔗
|
Somebody2 |
pnJay: doesn't matter |
19:58
🔗
|
Somebody2 |
as many or as few as you wish |
19:58
🔗
|
Somebody2 |
more than 12 is useless |
19:58
🔗
|
pnJay |
12 was the answer I was looking for, I guess. Thanks! |
19:59
🔗
|
Somebody2 |
I can raise it higher by turning on more projects |
20:00
🔗
|
Somebody2 |
does this work for the traceback? https://0bin.net/paste/Jaxv8+B3BSnKad0w#-+R683qB8HzivzsboLxiIAc1WyxTw+6nsH7ZaIsYr3j |
20:05
🔗
|
pnJay |
Will I break anything or piss anyone off if I point a buncha warriors at this? |
20:08
🔗
|
kpcyrd |
Somebody2: works now, does this happen for every url or just a few? |
20:11
🔗
|
kpcyrd |
Somebody2: and do you have the url that was requested in the logs? |
20:13
🔗
|
Somebody2 |
intermittantly |
20:14
🔗
|
Somebody2 |
here's the range of URLs -- it happened for one of them, I can't tell exactly which |
20:17
🔗
|
Somebody2 |
one thing I could do is make the ranges smaller, temporarily, to make it easier to figure out which ones are borking |
20:18
🔗
|
Somebody2 |
oh, and I know it's intermittant, because if we retry the same set of URLs enough times, I don't get the error |
20:18
🔗
|
Somebody2 |
OK, here's a range that it happened in: 38gr1cn TO 38gr1db |
20:19
🔗
|
Somebody2 |
that's 50 shortcodes |
20:19
🔗
|
Somebody2 |
cn, co, cp, cq ... d8, d9, da, db |
20:20
🔗
|
Somebody2 |
and this is for tinyurl.com |
20:20
🔗
|
Somebody2 |
whoops, sorry, 25, not 50 |
20:21
🔗
|
Somebody2 |
pnJay: nope, you are welcome to point a bunch of Warriors at this -- they just may not get very well used! |
20:28
🔗
|
|
caff_ has joined #urlteam |
20:32
🔗
|
|
caff has quit IRC (Read error: Operation timed out) |
20:42
🔗
|
psi |
i have 12 concs going at this now |
20:42
🔗
|
psi |
on 2 hosts |
20:49
🔗
|
|
hook54321 has quit IRC (Quit: Connection closed for inactivity) |
22:18
🔗
|
erine |
ave_: Wait, really? |
22:19
🔗
|
erine |
How long does it take to get the notices |
22:57
🔗
|
Somebody2 |
erine: what? |
22:57
🔗
|
erine |
Hetzner warning notices? |
22:58
🔗
|
Somebody2 |
Ah, I don't know anything about those. |
23:28
🔗
|
|
hook54321 has joined #urlteam |