[02:36] *** VariXx has quit IRC (Quit: moo) [03:05] Flashfire: I don't understand your question? It doens't matter how many shortcodes in a set return results, the AutoQueue treats it the same. [03:07] I mean for the non incremental batches. For example if batch 48 returns 35/50 Successful but batch 49 returns 50/50 successful will batch 49 be queued again when batch 48 is tried again at a later date [03:08] Somebody2 does that explain it any more [03:25] The non-incremental ones are just using a hash to convert the seq#s to shortcodes -- it's otherwise the same, [03:26] so it doesn't do any tracking of what gets returned [03:26] just that it got back info about the batch [03:26] does that answer your question? [03:27] I think there may still be some mutual misunderstanding going on [03:32] but Somebody2 if a batch returns all successful codes is that batch retired as such [03:50] *** kiska1 has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** wmvhater has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** Mayonaise has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** TigerbotH has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** treora has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** BnAboyZ has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** Kaz has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** Jusque has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** teej_ has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** chr1sm has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** HCross has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** voltagex_ has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** aard has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** bakJAA has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** Ctrl-S_ has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** deathy has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** eLbot has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** kpcyrd has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** buckket has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** Hecatz has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** zhongfu has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** Muad-Dib has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** moufu has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** Frogging has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** jornane has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** hook54321 has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** mr_archiv has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** joepie91 has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** z00nx has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** w0rmhole has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** kiska has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** Flashfire has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** dashcloud has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** Boppen has quit IRC (hub.efnet.us ny.us.hub) [03:50] *** dashcloud has joined #urlteam [03:50] *** kiska1 has joined #urlteam [03:50] *** wmvhater has joined #urlteam [03:50] *** moufu has joined #urlteam [03:50] *** Jusque has joined #urlteam [03:50] *** Frogging has joined #urlteam [03:50] *** BnAboyZ has joined #urlteam [03:50] *** treora has joined #urlteam [03:50] *** teej_ has joined #urlteam [03:50] *** chr1sm has joined #urlteam [03:50] *** HCross has joined #urlteam [03:50] *** voltagex_ has joined #urlteam [03:50] *** Mayonaise has joined #urlteam [03:50] *** buckket has joined #urlteam [03:50] *** Boppen has joined #urlteam [03:50] *** jornane has joined #urlteam [03:50] *** hook54321 has joined #urlteam [03:50] *** TigerbotH has joined #urlteam [03:50] *** aard has joined #urlteam [03:50] *** bakJAA has joined #urlteam [03:50] *** Ctrl-S_ has joined #urlteam [03:50] *** deathy has joined #urlteam [03:50] *** Kaz has joined #urlteam [03:50] *** eLbot has joined #urlteam [03:50] *** kpcyrd has joined #urlteam [03:50] *** Hecatz has joined #urlteam [03:50] *** zhongfu has joined #urlteam [03:50] *** Muad-Dib has joined #urlteam [03:50] *** mr_archiv has joined #urlteam [03:50] *** joepie91 has joined #urlteam [03:50] *** z00nx has joined #urlteam [03:50] *** w0rmhole has joined #urlteam [03:50] *** kiska has joined #urlteam [03:50] *** Flashfire has joined #urlteam [03:50] *** ny.us.hub sets mode: +oo HCross bakJAA [03:51] *** JAA sets mode: +o bakJAA [03:52] *** bakJAA sets mode: +o JAA [04:05] *** teej_ is now known as tees [04:05] *** tees is now known as teej_ [04:12] *** marked has left [04:13] *** odemg has quit IRC (Read error: Operation timed out) [04:14] *** dashcloud has quit IRC (Ping timeout: 492 seconds) [04:15] There's no "retired". The AutoQueue feature just keeps creating new batches with incremented sequence numbers. Everything else is manual. [04:16] Each batch is sent out to a single warrior, then the tracker waits for a specified amount of time (generally half an hour) for the warrior to report back. [04:16] *** teej_ is now known as te3j [04:17] If it does, then the batch is done, no matter what the contents of the results are. [04:17] If the warrior times out, or returns an error, the batch is sent out to a different warrior. [04:20] *** te3j is now known as teej_____ [04:21] *** teej_____ is now known as t3 [04:29] *** odemg has joined #urlteam [05:45] *** logchfoo0 starts logging #urlteam at Sat Dec 15 05:45:12 2018 [05:45] *** logchfoo0 has joined #urlteam [08:21] *** JAA has quit IRC (Ping timeout: 246 seconds) [08:21] *** svchfoo1 has quit IRC (Ping timeout: 246 seconds) [08:21] *** odemg has quit IRC (Ping timeout: 246 seconds) [09:20] *** svchfoo1 has joined #urlteam [09:20] *** JAA has joined #urlteam [09:20] *** bakJAA sets mode: +o JAA [09:21] *** svchfoo3 sets mode: +o svchfoo1 [09:27] *** odemg has joined #urlteam [09:27] *** chfoo has quit IRC (Read error: Operation timed out) [09:30] *** chfoo has joined #urlteam [10:41] *** VariXx has joined #urlteam [13:38] *** psi has joined #urlteam [14:13] *** dashcloud has joined #urlteam [14:30] *** trvz has joined #urlteam [14:31] Fusl: who says burstable [14:34] *** dashcloud has quit IRC (Ping timeout: 265 seconds) [14:35] you used google cloud, right? [14:35] *** ave_ has joined #urlteam [14:35] *** yano has joined #urlteam [14:35] hai [14:35] hi [14:35] *** jut has joined #urlteam [14:36] *** erine has joined #urlteam [14:36] let me quickly figure out if this all works properly with the docker warrior [14:36] ack [14:36] if it does, you can run it with for i in $(seq NUM_OF_THREADS); do docker container run -d --rm -e DOWNLOADER=Fusl -e SELECTED_PROJECT=urlteam2 -e CONCURRENT_ITEMS=1 archiveteam/warrior-dockerfile; done [14:37] seems to work fine [14:37] happy scraping [14:39] *** diggan has joined #urlteam [14:42] to install docker on a Debian 9 system: [14:42] apt -y install apt-transport-https ca-certificates curl gnupg2 software-properties-common; curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add -; add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"; apt update; apt -y install docker-ce [14:44] actually [14:44] curl https://get.docker.com | bash [14:45] oh [14:45] hah [14:45] i took my line from the docker docs and just combined their steps in to one long liner [14:49] Anyone here using hetzner? Did you get abuse warnings from them? I'll spin up 4-5 more warriors on my hetzner vpses if they don't issue abuse warnings for continuously scraping url shortening sites. [14:50] i'm using google cloud platform [14:58] what are limitations here? good concurrency numbers? [15:01] you don't want too much concurrency per ip [15:01] yeah, they tend to blacklist very quickly [15:01] so as long as you have an unlimited set of ips, everything is well [15:02] *** coldon2dr has joined #urlteam [15:03] what's too much? [15:16] that depends on the service [15:16] bit.ly blocks very fast [15:16] i had one thread running on a single ip, got banned within a few hours [15:21] Yeah, we should look into increasing the delay on bit.ly probably. [15:23] In general, the tracker only hands out one item per shortener and IP address. So running more than 5 concurrency on one IP is useless currently (because we have 5 shorteners active right now). [16:20] why is it not more than 5 active tho? [16:28] *** odemg has quit IRC (Read error: Connection reset by peer) [16:53] Because noone configured more shorteners. [16:59] *** LFlare43 has joined #urlteam [16:59] *** LFlare43 is now known as LFlare [18:15] *** odemg has joined #urlteam [18:16] *** odemg_ has joined #urlteam [18:16] *** odemg_ has quit IRC (Remote host closed the connection) [18:42] *** n00b_42 has joined #urlteam [18:46] If someone wants to dig into a fix for this error, that would be VERY APPRECIATED: "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 7: invalid continuation byte" [18:46] Here's the rest of the traceback: https://hastebin.com/wiqafihadi.sql [18:47] *** n00b_42 has quit IRC (Quit: http://chat.efnet.org (Ping timeout)) [18:47] I'll see about spinning up a bunch more projects to make use of the all the warriors we've got now. [18:49] turning on xco to see what happens [18:51] and waa-ai [18:52] and pub-vitrue-com [18:52] probably most of these won't actuall get us useful data, but WTH... [18:54] and u4-hu [18:55] and tighturl-com [18:56] turned isgd_6 back on, I think the warriors have been updated to handle HTTPS [19:03] turned on shar-es too [19:03] there are a BUNCH of nice features I'd like to add to the admin interface of urlteam... [19:04] like a *notes* field built into the admin interface [19:04] rather than having to check back on the wiki [19:04] and (if I'm dreaming) various summary info about which shortcode ranges we've already tried, and when [19:04] if someone who codes in Python wanted to help... [19:06] and turned on trap-it [19:07] So far, waa-ai, u4-hu, and tighturl-com have returned no results. [19:08] and u4-hu is throwing some errors, so I'm turning it off [19:10] so is waa-ai, so it's going off, too [19:12] trap-it is returning 301s that we didn't expect; now fixing that, and it should start generating results [19:13] added vbly-us [19:23] turning on mysp-ac [19:25] trap-it seems to just redirect all requests to scribblelive.com [19:25] *** mtntmnky has joined #urlteam [19:25] turning it off [19:27] OK, we're up to around 705 scans/sec -- I'm going to step away for a while. If/when everything blows up, feel free to ping me. [19:27] mysp-ac is returning results [19:27] Seeing some of these `Error communicating with tracker: HTTPConnectionPool(host='tracker.archiveteam.org', port=1337): Read timed out. (read timeout=60).` when uploading [19:28] yep, that happens [19:28] we're pushing the tracker hard [19:28] ah :P [19:42] Somebody2: I didn't read the whole backlog, but are you on linux, are locales configured and $LANG set correctly? [19:42] that's not the case in docker for example and python doesn't really like that for some reason [19:44] kpcyrd: that is certainly possible! [19:44] This is not on my local box, but on the URLTeam client code running on the warriors. [19:44] The repo that you'd be looking at is: https://github.com/ArchiveTeam/terroroftinytown [19:45] let me know if/how else I can help you get started investigating it more (should you wish to) [19:46] I'm having trouble loading the traceback for some reason [19:50] Somebody2: does your hastebin link work for you? It seems there's an issue with their server right now [19:52] *** pnJay has joined #urlteam [19:53] Doesn't work for me either. [19:56] how many concurrent per ip for this project? [19:58] hm, let me put it somewhere else [19:58] pnJay: doesn't matter [19:58] as many or as few as you wish [19:58] more than 12 is useless [19:58] 12 was the answer I was looking for, I guess. Thanks! [19:59] I can raise it higher by turning on more projects [20:00] does this work for the traceback? https://0bin.net/paste/Jaxv8+B3BSnKad0w#-+R683qB8HzivzsboLxiIAc1WyxTw+6nsH7ZaIsYr3j [20:05] Will I break anything or piss anyone off if I point a buncha warriors at this? [20:08] Somebody2: works now, does this happen for every url or just a few? [20:11] Somebody2: and do you have the url that was requested in the logs? [20:13] intermittantly [20:14] here's the range of URLs -- it happened for one of them, I can't tell exactly which [20:17] one thing I could do is make the ranges smaller, temporarily, to make it easier to figure out which ones are borking [20:18] oh, and I know it's intermittant, because if we retry the same set of URLs enough times, I don't get the error [20:18] OK, here's a range that it happened in: 38gr1cn TO 38gr1db [20:19] that's 50 shortcodes [20:19] cn, co, cp, cq ... d8, d9, da, db [20:20] and this is for tinyurl.com [20:20] whoops, sorry, 25, not 50 [20:21] pnJay: nope, you are welcome to point a bunch of Warriors at this -- they just may not get very well used! [20:28] *** caff_ has joined #urlteam [20:32] *** caff has quit IRC (Read error: Operation timed out) [20:42] i have 12 concs going at this now [20:42] on 2 hosts [20:49] *** hook54321 has quit IRC (Quit: Connection closed for inactivity) [22:18] ave_: Wait, really? [22:19] How long does it take to get the notices [22:57] erine: what? [22:57] Hetzner warning notices? [22:58] Ah, I don't know anything about those. [23:28] *** hook54321 has joined #urlteam