[00:02] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [00:24] *** nickname has quit IRC (Ping timeout: 260 seconds) [00:31] *** ndiddy has joined #archiveteam-bs [00:39] *** Zerote has quit IRC (Ping timeout: 260 seconds) [00:48] *** tech234a has joined #archiveteam-bs [01:18] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [01:26] *** Despatche has joined #archiveteam-bs [01:56] *** enowaldo has joined #archiveteam-bs [02:04] *** enowaldo has quit IRC (Ping timeout: 265 seconds) [03:16] *** odemgi_ has joined #archiveteam-bs [03:19] *** odemgi has quit IRC (Ping timeout: 252 seconds) [03:25] *** odemg has quit IRC (Ping timeout: 615 seconds) [03:31] *** qw3rty118 has joined #archiveteam-bs [03:31] *** odemg has joined #archiveteam-bs [03:35] *** qw3rty117 has quit IRC (Read error: Operation timed out) [04:04] *** simon816 has quit IRC (Read error: Operation timed out) [04:04] *** simon816 has joined #archiveteam-bs [04:22] *** atbk has quit IRC (Quit: ZNC - https://znc.in) [04:22] *** atbk has joined #archiveteam-bs [04:50] *** fuzy802 has joined #archiveteam-bs [04:52] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [05:00] *** fuzy802 is now known as fuzzy8021 [05:01] *** ndiddy has quit IRC () [05:13] I am now downloading the list, but transfer.sh is slow [05:22] *** fuzy802 has joined #archiveteam-bs [05:22] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [05:32] *** fuzy802 is now known as fuzzy8021 [05:37] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [05:38] *** fredgido_ has quit IRC (Read error: Operation timed out) [05:38] *** Exairnous has quit IRC (Read error: Operation timed out) [05:40] *** Exairnous has joined #archiveteam-bs [06:09] *** wp494 has quit IRC (Ping timeout: 492 seconds) [06:10] *** wp494 has joined #archiveteam-bs [06:16] *** Exairnous has quit IRC (Read error: Operation timed out) [06:21] *** Exairnous has joined #archiveteam-bs [06:21] *** polar has joined #archiveteam-bs [06:41] *** polar has quit IRC (Quit: Page closed) [07:13] *** BlueMaxim has joined #archiveteam-bs [07:17] *** BlueMax has quit IRC (Read error: Connection reset by peer) [07:35] *** Exairnous has quit IRC (Read error: Operation timed out) [07:57] *** Coderjo has quit IRC (Ping timeout: 265 seconds) [07:58] *** dxrt- is now known as dxrt [07:59] *** dxrt_ sets mode: +o dxrt [08:04] *** Coderjo has joined #archiveteam-bs [08:21] *** svchfoo1 sets mode: +o PurpleSym [08:23] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [09:10] *** Terbium has quit IRC (Ping timeout: 265 seconds) [09:30] *** Zerote has joined #archiveteam-bs [10:19] *** atomicthu has quit IRC (Read error: Connection reset by peer) [10:19] *** atomicthu has joined #archiveteam-bs [11:01] *** enowaldo has joined #archiveteam-bs [11:05] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [11:15] *** Reventlov has joined #archiveteam-bs [11:15] Hi. [11:20] *** bitBaron has joined #archiveteam-bs [11:26] hi Reventlov [11:26] *** enowaldo has joined #archiveteam-bs [11:34] *** fuzy802 has joined #archiveteam-bs [11:35] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [11:44] *** fuzy802 is now known as fuzzy8021 [11:50] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [11:52] About Archive Bot: how does it knows at which depth it should stop when following links? [11:53] (I ask because for example, in bn7khrtvzbr8nay7pip3k51w4, many domains that are not in the original list are crawled) [11:53] (such as youtube videos, and so on) [11:53] it stops at d = 1 for urls that are not on the original domains? [11:55] by the way this might be useful to someone: https://clbin.com/uRZx4 [11:56] Reventlov: by default it crawls offsite links to a depth of 1, including page requisites on said pages [11:57] ack [11:58] so, let's say we want to crawl *.blog.lemonde.fr: is this possible to feed into ArchiveBot the links that match this regex? [11:58] (self-feed) [11:59] it can't span domains other than the ones it started with [12:00] ok. Let's say we missed domains in the starting list: is there an easy way to extract everything in the crawled pages? Like archivebot grepping? [12:00] (context: french blogs shutting down) [12:03] you want to grep WARCs for URLs? [12:04] well I don't have access to the WARCs files, but, yeah, something like that [12:04] well they get uploaded to IA reasonably quickly [12:04] (for example, in the case of these blogs, there is an easy way to find all the urls: it's *.blog.lemonde.fr) [12:05] zcat WARC | grep -P '[-_a-zA-Z0-9]+\.blog\.lemonde\.fr' or similar [12:16] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [12:24] *** deevious has quit IRC (Quit: deevious) [12:33] *** enowaldo has joined #archiveteam-bs [12:42] *** enowaldo has quit IRC (Ping timeout: 268 seconds) [12:50] *** deevious has joined #archiveteam-bs [13:08] paul2520: Yes, https://archiveteam.org/dumps/ Weekly dumps there, and they also get uploaded to IA with some delay. [13:09] Reventlov: "Follow any example.org subdomain and recurse on it" isn't possible with ArchiveBot, but you can do it with wpull using the --domain or --accept-regex options. [13:10] (IIRC, --domain is not filtering for domains correctly, and --domain example.org would also recurse on badexample.org. That's very unlikely to be an issue in reality though.) [13:14] kiska: Did you manage to get my list? [13:15] *** enowaldo has joined #archiveteam-bs [13:15] Yes I got the list, and doing *something* to it [13:16] :-) [13:16] I have an hour now, so I'll still try to get the wpull setup working. [13:18] *** fuzy802 has joined #archiveteam-bs [13:19] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [13:19] *** PurpleSym has quit IRC (Ping timeout: 268 seconds) [13:19] JAA: I've less the list, and I've seen posts and profile endpoints, so that maybe easy to split off [13:19] *** PurpleSym has joined #archiveteam-bs [13:19] And I just found the topics endpint [13:19] *** svchfoo1 sets mode: +o PurpleSym [13:19] s/endpint/endpoint [13:20] *** MrRadar2 has quit IRC (Ping timeout: 268 seconds) [13:20] *** VoynichCr has quit IRC (Ping timeout: 268 seconds) [13:20] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [13:20] kiska: Yeah, that list was intended as a seed for a recursive grab, so it contains everything I could find essentially. [13:20] Users/channels (same but not the same?), posts, topics, and possibly some other stuff. [13:20] *** Dallas has quit IRC (Ping timeout: 268 seconds) [13:20] I guess we could do something similar to what we did during the tumblr grab and have the warriors also rsync a txt file [13:21] *** Reventlov has quit IRC (Ping timeout: 268 seconds) [13:21] Yeah, collect usernames, upload them to the rsync target, add them as items, reiterate until no new users are found. [13:22] s,users,users/channels, [13:22] Or even collect post IDs and also rsync that [13:22] (On the site, users and channels behave the same and use the same URL format, but they're separate on the API.) [13:22] Hmm, not sure, I think one item should be one (or more) user and all its posts. [13:23] Then you don't really need to collect post links. [13:23] I just collected them because the recursion won't grab them since the later pages are loaded through JS. [13:23] From what I can see, the posts/PostID endpoint will redirect to the profile/postID endpoint [13:23] *** Tenebrae has quit IRC (Ping timeout: 268 seconds) [13:24] So I may just exclude the posts/ endpoint in the list and see if I can grab just profiles I guess [13:24] *** overflowe has quit IRC (Remote host closed the connection) [13:25] *** BnAboyZ has quit IRC (Read error: Connection reset by peer) [13:25] *** Reventlov has joined #archiveteam-bs [13:26] Looks like everything in a topic will just direct to the post by a profile, so we may not grab topics, if thats the case [13:26] *** Xibalba has quit IRC (Ping timeout: 268 seconds) [13:26] *** Tenebrae has joined #archiveteam-bs [13:27] Yes, I believe that's the case. I used topics to discover more users/channels, basically. [13:27] *** asie has quit IRC (Ping timeout: 268 seconds) [13:27] Most users came from the "random users" page though, which I hammered for an hour or two. [13:27] Or we can grab topics anyway since its part of a collection of posts [13:28] *** fuzy802 is now known as fuzzy8021 [13:28] Ah xD maybe we can do a disco project where we just hammer that random users page [13:29] Probably won't find too many more users I think. [13:29] I ran it until it wasn't finding much anymore. [13:30] Most important is to get it up and running quickly. [13:31] We'll do more disco, if we run out of profiles [13:31] Argh, not this shit again. [13:31] Hrm? [13:31] *** overflowe has joined #archiveteam-bs [13:31] Trying to compile Python 3.4 on your machines and running into OpenSSL version incompatibilities. [13:31] *** sHATNER has quit IRC (Ping timeout: 365 seconds) [13:32] How long does Hetzner typically take for first order? [13:32] *** Tenebrae has quit IRC (Ping timeout: 268 seconds) [13:32] *** Xibalba has joined #archiveteam-bs [13:33] *** sHATNER has joined #archiveteam-bs [13:33] I ordered it nearly 20 hours ago and still nada :-/ [13:33] Got order confirmation [13:33] *** asie has joined #archiveteam-bs [13:33] *** Tenebrae has joined #archiveteam-bs [13:33] *** Dallas has joined #archiveteam-bs [13:34] *** BnAboyZ has joined #archiveteam-bs [13:34] *** MrRadar2 has joined #archiveteam-bs [13:34] *** svchfoo1 sets mode: +o MrRadar2 [13:34] *** svchfoo3 sets mode: +o MrRadar2 [13:36] Who else does machine management with for host in ...; do ssh root@${host} 'apt install something'; done ? :-) [13:36] :p [13:36] *** overflowe has quit IRC (Remote host closed the connection) [13:36] *** overflowe has joined #archiveteam-bs [13:37] I should do something more clever with tmux panels instead. [13:37] *** VoynichCr has joined #archiveteam-bs [13:40] I tend to use ansible's ad-hoc commands. [13:40] (because I use ansible for configuation anyway.) [13:44] *** atomicthu has quit IRC (Read error: Connection reset by peer) [13:44] *** atomicthu has joined #archiveteam-bs [13:44] Oooh, this is so much better. [13:44] astrid: Thanks again for telling me about tmux synchronize-panes sometime last year! It's absolutely awesome. [13:45] JAA: Do you require me to downgrade the version of Debian on them? [13:46] kiska: No, it's fine. Just had to install the OpenSSL 1.0 headers instead of 1.1. [13:46] Ah [13:46] So libssl1.0-dev instead of libssl-dev. [13:47] Anyway I'll start to look at starting a warrior if its needed, or if you can't get your wpull thing active [13:47] I hope you don't intend to use these for anything else though since they're a mess already. [13:47] Nope, they are destroyed afterwards anyway [13:47] :-) [13:47] The header thing is not a real issue here, but in another case I actually needed the 1.1 headers for something else. [13:48] It's really annoying that you can't have both installed at the same time. [13:49] *** enowaldo has joined #archiveteam-bs [13:52] *** apache2 has quit IRC (Remote host closed the connection) [13:53] *** apache2 has joined #archiveteam-bs [14:07] *** atomicthu has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [14:13] Dear lord.... What am I looking at... [14:14] *** godane has quit IRC (Read error: Connection reset by peer) [14:18] ? [14:19] sola.ai and the onclick event [14:23] Ah [14:25] Yeah, if you stare into the abyss, the abyss stares back at you... [14:26] *** Zerote has quit IRC (Ping timeout: 260 seconds) [14:27] *** atomicthu has joined #archiveteam-bs [14:29] *** godane has joined #archiveteam-bs [14:45] *** Terbium has joined #archiveteam-bs [15:02] *** bitBaron has joined #archiveteam-bs [15:04] *** enowaldo has quit IRC (Read error: Operation timed out) [15:05] *** godane has quit IRC (Ping timeout: 265 seconds) [15:25] *** godane has joined #archiveteam-bs [15:40] *** drcd has joined #archiveteam-bs [16:03] *** coderobe has quit IRC (Quit: '); DROP TABLE users;--) [16:06] *** coderobe has joined #archiveteam-bs [16:06] kiska: Just to avoid any misunderstandings, please do work towards getting it running. I'm still working on the wpull setup when I can, but there's a decent chance I won't be able to get it up in time. [16:09] I got my wpull version running in principle, but there's some issue with the DB right now, and I won't have much more time tonight to work on this. [16:09] *** schbirid has joined #archiveteam-bs [16:10] Also, who the fuck thought it's a good idea to have a footer on an infinite-scroll website?? [16:10] My response as well [16:10] The site will shut down in a bit under 20 hours, by the way. (12:00 GMT on the 10th) [16:11] Yeah I am having issues with the "Load more" button on profiles, wget isn't "clicking" them [16:12] *** enowaldo has joined #archiveteam-bs [16:12] Their js script file has 177k lines which I am having trouble parsing, seeing if I can emulate the requests [16:12] JS... [16:13] Yeah, I just observed the requests in the browser and then scripted around that. [16:13] For scraping topics etc. [16:14] For user profiles, I did /users/${user}/posts/?limit=30&offset=XXX&limit=30 [16:14] Yes, two times limit because reasons. [16:14] Hrm... Might be able to use that to emulate the req [16:15] Apparently 7c236626-a12c-48d1-990f-07e4b3c2d884 = freistaat... [16:15] Yeah, each user has a UUID. [16:16] That's what I looked at first to see whether there's a way to iterate over all users through integers. Nope. [16:17] I am trying to figure out how they generate that UUID [16:19] kiska: Here are the commands I used to generate that list, by the way; maybe it's useful. http://ix.io/1FL0+/sola.ai-cmds [16:19] GNU parallel + curl is pretty nice. [16:19] I'll take a look, but I am looking at their getUserPosts() function [16:20] Oh wait, I used xargs with -P, not parallel, nvm. [16:21] Firefox is chugging with prettifying their js [16:21] Can transfer.sh please get their shit together? I have some 269k Sola API URLs for ArchiveBot... [16:22] (Can't put it on ix.io since the file's way above the size limit.) [16:25] pastebin? [16:25] Random filenames in the AB archives? No thanks. [16:26] We have too many "urls-pastebin.com-rAnDoM-inf-20180101-000000-asdfg-00000.warc.gz" files in there already. [16:27] WTF is this?! https://pastebin.com/MbfJGzni [16:27] The UUID is here [16:28] It looks like JSON data to me [16:28] Yup, it is JSON. [16:28] Where's that from? [16:28] The html of the profile [16:29] https://sola.ai/freistaat view source this [16:29] Ah [16:29] *** enowaldo has quit IRC (Read error: Operation timed out) [16:29] Looks like I'll be working out how to use arkiver's JSON library he used in G+ grab [16:29] ? [16:30] Everything has UUID at Sola. Posts, post parts, images (I think), users, channels... [16:30] sola.ai going down? [16:30] Yes, tomorrow at noon UTC. [16:31] right deadline [16:32] tight* [16:32] do we have a list of users? [16:32] We have a seed located here: https://transfer.sh/bfKxz/sola.ai-out [16:32] Yeah, incomplete scrape. [16:32] kiska: or are you handling this one? [16:32] See also the sola.ai-cmds file I linked above. [16:33] I scraped a variety of resources ("random users" page, topics, recent posts, etc.) [16:33] I think the words are "attempting to" [16:33] But nothing's archived yet. [16:33] JAA: can you post it for me again please [16:33] 04-09 16:19:00 <@JAA> kiska: Here are the commands I used to generate that list, by the way; maybe it's useful. http://ix.io/1FL0+/sola.ai-cmds [16:34] and your results? [16:34] sorry, don´t see it here [16:34] What kiska just linked. [16:34] Oh wait, transfer.sh's having issues, right. [16:35] *sigh* [16:36] I know that feeling. [16:36] I am uploading a item to IA with that file [16:36] My API URL job in ArchiveBot just started retrieving stuff. [16:37] So at least something is being saved. [16:37] Ah excellent! [16:38] Need to go now, so I won't get my wpull grab up probably. Anything you can do will be greatly appreciated. [16:38] Or should I split the list up and at least !ao < it? [16:39] (I know you suggested that before and I said it'd be incomplete since it's only a seed list, but something's better than nothing, and with the deadline looming...) [16:39] Can we split? Put a web server on one of the machines I gave you [16:39] And feed the urls that way into AB [16:40] Hmm, it seems like there are some free resources on AB at the moment anyway since my API job ended up on a normal pipeline. [16:40] Don't have time to set up AB right now. [16:41] You don't need to setup AB, just set one of them up as a web server, split files and upload to that machine. feed resulting url to AB. I'll try and set up AB on those machines once I give up [16:41] s/files/list [16:41] Ah [16:42] Yeah, that's basically what I just did with my irssi machine for the API grab. python3 -m http.server FTW. [16:42] Ah xD [16:45] *** Jonimus has quit IRC (Quit: WeeChat 1.9.1) [16:45] 2.8 million users? [16:45] *** Jonimus has joined #archiveteam-bs [16:45] No, that file has users, posts, and a bunch of other stuff. [16:46] *** svchfoo1 sets mode: +o Jonimus [16:46] *** svchfoo3 sets mode: +o Jonimus [16:46] Don't remember how many users I found, I think ~60k or so? [16:46] ah I see [16:46] Here is IA item: https://archive.org/details/sola.ai-out [16:46] right, probably no warrior project needed? [16:46] If you can't see the list [16:46] thanks [16:46] yeah, got it from JAA [16:46] Ah xD [16:46] 61513 users [16:47] I don´t think we really need a warrior project for this? [16:50] I don't think we do either, but as JAA said his scrape is certainly incomplete [16:51] arkiver: The idea was to start a recursive grab from that list, but that's too large for ArchiveBot or a single wpull instance. So I wanted to do a distributed grab with my special wpull version for that, but I didn't have enough time to get that running as I'm really busy this week. [16:58] *** fredgido has joined #archiveteam-bs [16:59] *** kode54 has quit IRC (Quit: Ping timeout (120 seconds)) [16:59] *** apache2 has quit IRC (Remote host closed the connection) [16:59] *** atbk has quit IRC (Quit: ZNC - https://znc.in) [16:59] *** rellem has quit IRC (Quit: Blimey! The Roo's in my sock drawer!) [16:59] *** eientei95 has quit IRC (Quit: ZNC 1.7.0+deb0+bionic1 - https://znc.in) [16:59] *** DFJustin has quit IRC (Remote host closed the connection) [16:59] *** atbk has joined #archiveteam-bs [16:59] *** apache2 has joined #archiveteam-bs [16:59] *** kode54 has joined #archiveteam-bs [16:59] *** DFJustin has joined #archiveteam-bs [16:59] *** rellem has joined #archiveteam-bs [17:03] *** eientei95 has joined #archiveteam-bs [17:04] *** svchfoo1 sets mode: +o eientei95 [17:04] *** svchfoo3 sets mode: +o eientei95 [17:08] *** enowaldo has joined #archiveteam-bs [17:15] *** Zerote has joined #archiveteam-bs [17:19] *** enowaldo has quit IRC (Read error: Operation timed out) [17:20] *** fredgido has quit IRC (Read error: Operation timed out) [17:32] *** coderobe has quit IRC (Read error: Connection reset by peer) [17:44] *** enowaldo has joined #archiveteam-bs [17:53] *** coderobe has joined #archiveteam-bs [17:54] *** Pixi` has quit IRC (Quit: Pixi`) [17:55] *** Pixi has joined #archiveteam-bs [18:06] *** tech234a has joined #archiveteam-bs [18:12] *** Exairnous has joined #archiveteam-bs [18:15] *** Fusl has quit IRC (K-Lined) [18:16] *** Fusl has joined #archiveteam-bs [18:17] *** Fusl_ sets mode: +o Fusl [18:33] *** ayanami_ has joined #archiveteam-bs [18:34] *** Hani111 has joined #archiveteam-bs [18:34] So this one forum I'm on, all of a sudden, without any notice, is removing anything and everything off topic [18:35] to the main purpose of the forum [18:35] Kind of want to do a quick archivebot job but it might be too big [18:36] https://forum.gethopscotch.com/t/changes-to-the-forum/51127 Here's a link [18:37] I don't really go on this forum often though so I didnt know until I logged in just now [18:37] *** Hani has quit IRC (Ping timeout: 246 seconds) [18:38] *** Hani111 is now known as Hani [18:39] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [18:48] *** ayanami_ has quit IRC (Quit: Leaving) [18:49] *** enowaldo has quit IRC (Read error: Operation timed out) [18:52] *** Hani111 has joined #archiveteam-bs [18:55] *** Hani has quit IRC (Read error: Operation timed out) [18:55] *** Hani111 is now known as Hani [19:05] *** killsushi has joined #archiveteam-bs [19:17] *** bitBaron has joined #archiveteam-bs [19:39] *** BartoCH has quit IRC (Ping timeout: 615 seconds) [19:44] *** enowaldo has joined #archiveteam-bs [19:49] *** enowaldo has quit IRC (Ping timeout: 265 seconds) [19:56] *** achip has quit IRC (Ping timeout: 255 seconds) [20:04] *** achip has joined #archiveteam-bs [20:08] *** enowaldo has joined #archiveteam-bs [20:16] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [20:42] *** bobmcjr has quit IRC (Read error: Operation timed out) [21:08] *** enowaldo has quit IRC (Read error: Operation timed out) [21:14] *** RichardG has quit IRC (Read error: Connection reset by peer) [21:15] *** RichardG has joined #archiveteam-bs [21:19] *** RichardG_ has joined #archiveteam-bs [21:19] *** RichardG has quit IRC (Read error: Connection reset by peer) [21:19] *** RichardG_ is now known as RichardG [21:21] *** tuluu has joined #archiveteam-bs [21:22] *** BartoCH has joined #archiveteam-bs [21:31] why is the tracker down? [21:36] seems up from here [21:36] what do you mean by down [21:36] [astrid@xn--zoty-01a:14:36 0 ~]$ uptime [21:36] 14:36:57 up 136 days, 5:45, 2 users, load average: 1.75, 1.97, 2.06 [21:37] down as in http://xor.meo.ws/2efd5efc/091a/4558/acea/66234909aafa.png [21:37] We're sorry, but something went wrong. [21:37] The issue has been logged for investigation. Please try again later. [21:37] hm yeah that looks hosed [21:38] above my pay grade, sorry [21:39] above my access level, sorry [21:39] and i'm off now anyways so [21:39] Kaz chfoo: any of you wanna take a look? [21:39] *** Ravenloft has joined #archiveteam-bs [21:39] Will look soon [21:44] redis needs a restart [21:44] Fixed? [21:45] I did it [22:00] *** enowaldo has joined #archiveteam-bs [22:13] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [22:32] *** BlueMax has joined #archiveteam-bs [23:09] *** coderobe2 has joined #archiveteam-bs [23:11] *** coderobe2 has quit IRC (Read error: Connection reset by peer) [23:18] *** BlueMax has quit IRC (Read error: Connection reset by peer) [23:23] Ah crap, my ArchiveBot jobs for sola.ai are not retrieving the post contents because the URLs I threw in are redirects. [23:26] *** fredgido has joined #archiveteam-bs [23:34] kiska, arkiver: ^ [23:51] *** enowaldo has joined #archiveteam-bs [23:56] Or more precisely, because they're redirects to a different path (e.g. /posts/YWRmZGV/ -> /informative/i-got-5-want-more-peoples-come-on-at-comments-we-will-creat-YWRmZGV). [23:58] *** KoalaBear has quit IRC (Read error: Connection reset by peer)