#urlteam 2018-12-15,Sat

↑back Search

Time Nickname Message
02:36 🔗 VariXx has quit IRC (Quit: moo)
03:05 🔗 Somebody2 Flashfire: I don't understand your question? It doens't matter how many shortcodes in a set return results, the AutoQueue treats it the same.
03:07 🔗 Flashfire I mean for the non incremental batches. For example if batch 48 returns 35/50 Successful but batch 49 returns 50/50 successful will batch 49 be queued again when batch 48 is tried again at a later date
03:08 🔗 Flashfire Somebody2 does that explain it any more
03:25 🔗 Somebody2 The non-incremental ones are just using a hash to convert the seq#s to shortcodes -- it's otherwise the same,
03:26 🔗 Somebody2 so it doesn't do any tracking of what gets returned
03:26 🔗 Somebody2 just that it got back info about the batch
03:26 🔗 Somebody2 does that answer your question?
03:27 🔗 Somebody2 I think there may still be some mutual misunderstanding going on
03:32 🔗 Flashfire but Somebody2 if a batch returns all successful codes is that batch retired as such
03:50 🔗 kiska1 has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 wmvhater has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 Mayonaise has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 TigerbotH has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 treora has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 BnAboyZ has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 Kaz has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 Jusque has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 teej_ has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 chr1sm has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 HCross has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 voltagex_ has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 aard has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 bakJAA has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 Ctrl-S_ has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 deathy has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 eLbot has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 kpcyrd has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 buckket has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 Hecatz has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 zhongfu has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 Muad-Dib has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 moufu has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 Frogging has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 jornane has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 hook54321 has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 mr_archiv has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 joepie91 has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 z00nx has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 w0rmhole has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 kiska has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 Flashfire has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 dashcloud has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 Boppen has quit IRC (hub.efnet.us ny.us.hub)
03:50 🔗 dashcloud has joined #urlteam
03:50 🔗 kiska1 has joined #urlteam
03:50 🔗 wmvhater has joined #urlteam
03:50 🔗 moufu has joined #urlteam
03:50 🔗 Jusque has joined #urlteam
03:50 🔗 Frogging has joined #urlteam
03:50 🔗 BnAboyZ has joined #urlteam
03:50 🔗 treora has joined #urlteam
03:50 🔗 teej_ has joined #urlteam
03:50 🔗 chr1sm has joined #urlteam
03:50 🔗 HCross has joined #urlteam
03:50 🔗 voltagex_ has joined #urlteam
03:50 🔗 Mayonaise has joined #urlteam
03:50 🔗 buckket has joined #urlteam
03:50 🔗 Boppen has joined #urlteam
03:50 🔗 jornane has joined #urlteam
03:50 🔗 hook54321 has joined #urlteam
03:50 🔗 TigerbotH has joined #urlteam
03:50 🔗 aard has joined #urlteam
03:50 🔗 bakJAA has joined #urlteam
03:50 🔗 Ctrl-S_ has joined #urlteam
03:50 🔗 deathy has joined #urlteam
03:50 🔗 Kaz has joined #urlteam
03:50 🔗 eLbot has joined #urlteam
03:50 🔗 kpcyrd has joined #urlteam
03:50 🔗 Hecatz has joined #urlteam
03:50 🔗 zhongfu has joined #urlteam
03:50 🔗 Muad-Dib has joined #urlteam
03:50 🔗 mr_archiv has joined #urlteam
03:50 🔗 joepie91 has joined #urlteam
03:50 🔗 z00nx has joined #urlteam
03:50 🔗 w0rmhole has joined #urlteam
03:50 🔗 kiska has joined #urlteam
03:50 🔗 Flashfire has joined #urlteam
03:50 🔗 ny.us.hub sets mode: +oo HCross bakJAA
03:51 🔗 JAA sets mode: +o bakJAA
03:52 🔗 bakJAA sets mode: +o JAA
04:05 🔗 teej_ is now known as tees
04:05 🔗 tees is now known as teej_
04:12 🔗 marked has left
04:13 🔗 odemg has quit IRC (Read error: Operation timed out)
04:14 🔗 dashcloud has quit IRC (Ping timeout: 492 seconds)
04:15 🔗 Somebody2 There's no "retired". The AutoQueue feature just keeps creating new batches with incremented sequence numbers. Everything else is manual.
04:16 🔗 Somebody2 Each batch is sent out to a single warrior, then the tracker waits for a specified amount of time (generally half an hour) for the warrior to report back.
04:16 🔗 teej_ is now known as te3j
04:17 🔗 Somebody2 If it does, then the batch is done, no matter what the contents of the results are.
04:17 🔗 Somebody2 If the warrior times out, or returns an error, the batch is sent out to a different warrior.
04:20 🔗 te3j is now known as teej_____
04:21 🔗 teej_____ is now known as t3
04:29 🔗 odemg has joined #urlteam
05:45 🔗 logchfoo0 starts logging #urlteam at Sat Dec 15 05:45:12 2018
05:45 🔗 logchfoo0 has joined #urlteam
08:21 🔗 JAA has quit IRC (Ping timeout: 246 seconds)
08:21 🔗 svchfoo1 has quit IRC (Ping timeout: 246 seconds)
08:21 🔗 odemg has quit IRC (Ping timeout: 246 seconds)
09:20 🔗 svchfoo1 has joined #urlteam
09:20 🔗 JAA has joined #urlteam
09:20 🔗 bakJAA sets mode: +o JAA
09:21 🔗 svchfoo3 sets mode: +o svchfoo1
09:27 🔗 odemg has joined #urlteam
09:27 🔗 chfoo has quit IRC (Read error: Operation timed out)
09:30 🔗 chfoo has joined #urlteam
10:41 🔗 VariXx has joined #urlteam
13:38 🔗 psi has joined #urlteam
14:13 🔗 dashcloud has joined #urlteam
14:30 🔗 trvz has joined #urlteam
14:31 🔗 trvz Fusl: who says burstable
14:34 🔗 dashcloud has quit IRC (Ping timeout: 265 seconds)
14:35 🔗 Fusl you used google cloud, right?
14:35 🔗 ave_ has joined #urlteam
14:35 🔗 yano has joined #urlteam
14:35 🔗 yano hai
14:35 🔗 Fusl hi
14:35 🔗 jut has joined #urlteam
14:36 🔗 erine has joined #urlteam
14:36 🔗 Fusl let me quickly figure out if this all works properly with the docker warrior
14:36 🔗 yano ack
14:36 🔗 Fusl if it does, you can run it with for i in $(seq NUM_OF_THREADS); do docker container run -d --rm -e DOWNLOADER=Fusl -e SELECTED_PROJECT=urlteam2 -e CONCURRENT_ITEMS=1 archiveteam/warrior-dockerfile; done
14:37 🔗 Fusl seems to work fine
14:37 🔗 Fusl happy scraping
14:39 🔗 diggan has joined #urlteam
14:42 🔗 yano to install docker on a Debian 9 system:
14:42 🔗 yano apt -y install apt-transport-https ca-certificates curl gnupg2 software-properties-common; curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add -; add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"; apt update; apt -y install docker-ce
14:44 🔗 Fusl actually
14:44 🔗 Fusl curl https://get.docker.com | bash
14:45 🔗 yano oh
14:45 🔗 yano hah
14:45 🔗 yano i took my line from the docker docs and just combined their steps in to one long liner
14:49 🔗 ave_ Anyone here using hetzner? Did you get abuse warnings from them? I'll spin up 4-5 more warriors on my hetzner vpses if they don't issue abuse warnings for continuously scraping url shortening sites.
14:50 🔗 yano i'm using google cloud platform
14:58 🔗 trvz what are limitations here? good concurrency numbers?
15:01 🔗 psi you don't want too much concurrency per ip
15:01 🔗 Fusl yeah, they tend to blacklist very quickly
15:01 🔗 Fusl so as long as you have an unlimited set of ips, everything is well
15:02 🔗 coldon2dr has joined #urlteam
15:03 🔗 trvz what's too much?
15:16 🔗 Fusl that depends on the service
15:16 🔗 Fusl bit.ly blocks very fast
15:16 🔗 Fusl i had one thread running on a single ip, got banned within a few hours
15:21 🔗 JAA Yeah, we should look into increasing the delay on bit.ly probably.
15:23 🔗 JAA In general, the tracker only hands out one item per shortener and IP address. So running more than 5 concurrency on one IP is useless currently (because we have 5 shorteners active right now).
16:20 🔗 Fusl why is it not more than 5 active tho?
16:28 🔗 odemg has quit IRC (Read error: Connection reset by peer)
16:53 🔗 JAA Because noone configured more shorteners.
16:59 🔗 LFlare43 has joined #urlteam
16:59 🔗 LFlare43 is now known as LFlare
18:15 🔗 odemg has joined #urlteam
18:16 🔗 odemg_ has joined #urlteam
18:16 🔗 odemg_ has quit IRC (Remote host closed the connection)
18:42 🔗 n00b_42 has joined #urlteam
18:46 🔗 Somebody2 If someone wants to dig into a fix for this error, that would be VERY APPRECIATED: "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 7: invalid continuation byte"
18:46 🔗 Somebody2 Here's the rest of the traceback: https://hastebin.com/wiqafihadi.sql
18:47 🔗 n00b_42 has quit IRC (Quit: http://chat.efnet.org (Ping timeout))
18:47 🔗 Somebody2 I'll see about spinning up a bunch more projects to make use of the all the warriors we've got now.
18:49 🔗 Somebody2 turning on xco to see what happens
18:51 🔗 Somebody2 and waa-ai
18:52 🔗 Somebody2 and pub-vitrue-com
18:52 🔗 Somebody2 probably most of these won't actuall get us useful data, but WTH...
18:54 🔗 Somebody2 and u4-hu
18:55 🔗 Somebody2 and tighturl-com
18:56 🔗 Somebody2 turned isgd_6 back on, I think the warriors have been updated to handle HTTPS
19:03 🔗 Somebody2 turned on shar-es too
19:03 🔗 Somebody2 there are a BUNCH of nice features I'd like to add to the admin interface of urlteam...
19:04 🔗 Somebody2 like a *notes* field built into the admin interface
19:04 🔗 Somebody2 rather than having to check back on the wiki
19:04 🔗 Somebody2 and (if I'm dreaming) various summary info about which shortcode ranges we've already tried, and when
19:04 🔗 Somebody2 if someone who codes in Python wanted to help...
19:06 🔗 Somebody2 and turned on trap-it
19:07 🔗 Somebody2 So far, waa-ai, u4-hu, and tighturl-com have returned no results.
19:08 🔗 Somebody2 and u4-hu is throwing some errors, so I'm turning it off
19:10 🔗 Somebody2 so is waa-ai, so it's going off, too
19:12 🔗 Somebody2 trap-it is returning 301s that we didn't expect; now fixing that, and it should start generating results
19:13 🔗 Somebody2 added vbly-us
19:23 🔗 Somebody2 turning on mysp-ac
19:25 🔗 Somebody2 trap-it seems to just redirect all requests to scribblelive.com
19:25 🔗 mtntmnky has joined #urlteam
19:25 🔗 Somebody2 turning it off
19:27 🔗 Somebody2 OK, we're up to around 705 scans/sec -- I'm going to step away for a while. If/when everything blows up, feel free to ping me.
19:27 🔗 Somebody2 mysp-ac is returning results
19:27 🔗 psi Seeing some of these `Error communicating with tracker: HTTPConnectionPool(host='tracker.archiveteam.org', port=1337): Read timed out. (read timeout=60).` when uploading
19:28 🔗 Somebody2 yep, that happens
19:28 🔗 Somebody2 we're pushing the tracker hard
19:28 🔗 psi ah :P
19:42 🔗 kpcyrd Somebody2: I didn't read the whole backlog, but are you on linux, are locales configured and $LANG set correctly?
19:42 🔗 kpcyrd that's not the case in docker for example and python doesn't really like that for some reason
19:44 🔗 Somebody2 kpcyrd: that is certainly possible!
19:44 🔗 Somebody2 This is not on my local box, but on the URLTeam client code running on the warriors.
19:44 🔗 Somebody2 The repo that you'd be looking at is: https://github.com/ArchiveTeam/terroroftinytown
19:45 🔗 Somebody2 let me know if/how else I can help you get started investigating it more (should you wish to)
19:46 🔗 kpcyrd I'm having trouble loading the traceback for some reason
19:50 🔗 kpcyrd Somebody2: does your hastebin link work for you? It seems there's an issue with their server right now
19:52 🔗 pnJay has joined #urlteam
19:53 🔗 JAA Doesn't work for me either.
19:56 🔗 pnJay how many concurrent per ip for this project?
19:58 🔗 Somebody2 hm, let me put it somewhere else
19:58 🔗 Somebody2 pnJay: doesn't matter
19:58 🔗 Somebody2 as many or as few as you wish
19:58 🔗 Somebody2 more than 12 is useless
19:58 🔗 pnJay 12 was the answer I was looking for, I guess. Thanks!
19:59 🔗 Somebody2 I can raise it higher by turning on more projects
20:00 🔗 Somebody2 does this work for the traceback? https://0bin.net/paste/Jaxv8+B3BSnKad0w#-+R683qB8HzivzsboLxiIAc1WyxTw+6nsH7ZaIsYr3j
20:05 🔗 pnJay Will I break anything or piss anyone off if I point a buncha warriors at this?
20:08 🔗 kpcyrd Somebody2: works now, does this happen for every url or just a few?
20:11 🔗 kpcyrd Somebody2: and do you have the url that was requested in the logs?
20:13 🔗 Somebody2 intermittantly
20:14 🔗 Somebody2 here's the range of URLs -- it happened for one of them, I can't tell exactly which
20:17 🔗 Somebody2 one thing I could do is make the ranges smaller, temporarily, to make it easier to figure out which ones are borking
20:18 🔗 Somebody2 oh, and I know it's intermittant, because if we retry the same set of URLs enough times, I don't get the error
20:18 🔗 Somebody2 OK, here's a range that it happened in: 38gr1cn TO 38gr1db
20:19 🔗 Somebody2 that's 50 shortcodes
20:19 🔗 Somebody2 cn, co, cp, cq ... d8, d9, da, db
20:20 🔗 Somebody2 and this is for tinyurl.com
20:20 🔗 Somebody2 whoops, sorry, 25, not 50
20:21 🔗 Somebody2 pnJay: nope, you are welcome to point a bunch of Warriors at this -- they just may not get very well used!
20:28 🔗 caff_ has joined #urlteam
20:32 🔗 caff has quit IRC (Read error: Operation timed out)
20:42 🔗 psi i have 12 concs going at this now
20:42 🔗 psi on 2 hosts
20:49 🔗 hook54321 has quit IRC (Quit: Connection closed for inactivity)
22:18 🔗 erine ave_: Wait, really?
22:19 🔗 erine How long does it take to get the notices
22:57 🔗 Somebody2 erine: what?
22:57 🔗 erine Hetzner warning notices?
22:58 🔗 Somebody2 Ah, I don't know anything about those.
23:28 🔗 hook54321 has joined #urlteam

irclogger-viewer