[00:45] *** dashcloud has quit IRC (Read error: Connection reset by peer)
[00:52] *** VerifiedJ has quit IRC (Leaving)
[01:17] *** Sk1d has quit IRC (Read error: Operation timed out)
[01:19] *** Sk1d has joined #archiveteam-bs
[01:27] *** dashcloud has joined #archiveteam-bs
[01:31] *** Sk1d has quit IRC (Read error: Operation timed out)
[01:34] *** Sk1d has joined #archiveteam-bs
[02:29] *** Stiletto has joined #archiveteam-bs
[02:31] *** Stilett0 has quit IRC (Ping timeout: 265 seconds)
[02:35] *** Sk1d has quit IRC (Read error: Operation timed out)
[02:39] *** Sk1d has joined #archiveteam-bs
[02:51] *** Sk1d has quit IRC (Read error: Operation timed out)
[02:56] *** Sk1d has joined #archiveteam-bs
[04:25] *** Stilett0 has joined #archiveteam-bs
[04:30] *** Stiletto has quit IRC (Read error: Operation timed out)
[04:30] *** Stiletto has joined #archiveteam-bs
[04:33] *** Stilett0 has quit IRC (Read error: Operation timed out)
[04:47] *** qw3rty111 has joined #archiveteam-bs
[04:50] *** qw3rty119 has quit IRC (Read error: Operation timed out)
[05:09] *** BlueMaxim has joined #archiveteam-bs
[05:12] *** BlueMax has quit IRC (Ping timeout: 260 seconds)
[05:42] *** jodizzle has joined #archiveteam-bs
[06:42] *** fenn has quit IRC (Remote host closed the connection)
[06:51] *** Mateon1 has quit IRC (Ping timeout: 255 seconds)
[06:51] *** Mateon1 has joined #archiveteam-bs
[08:05] <Lord_Nigh> frhed is safe, sort of: https://web.archive.org/web/20181105080330/http://www.snipersparadise.net/download/unrealeditor/frhed-v1.1.zip
[08:06] <Lord_Nigh> by 'sort of' i mean that zip file is slightly modified, someone added a .txt file that the original zip did not have
[08:06] <Lord_Nigh> the .exe inside is exactly the same crc32 though
[08:06] *** Sk1d has quit IRC (Read error: Operation timed out)
[08:09] <Lord_Nigh> https://web.archive.org/web/20181105080857/http://unrealtexture.com/Unreal/Downloads/Tutorials/frhed-v1.1.zip is an untainted original archive
[08:10] *** Sk1d has joined #archiveteam-bs
[08:12] <Lord_Nigh> https://web.archive.org/web/20181105081155/https://sobac.com/ZIPfiles/File%20Utilities/frhed-v1.1.zip is another untainted one
[08:12] <Lord_Nigh> and NONE of those sites were properly archived, so i fed em to the bot
[08:13] <Flashfire> Lord_Night what collection are they from?
[08:36] <Flashfire> I am going to make a controversial page
[08:45] <Lord_Nigh> Flashfire: random sites; the first two are old unreal/unreal tournament modding files archives
[08:45] <Lord_Nigh> the latter is weird
[08:46] <Flashfire> Alright also Lord_Nigh what do you think I decided to start another wiki page
[08:46] <Lord_Nigh> ? why should that bother me?
[08:46] <Lord_Nigh> I barely have anything to do with the wiki
[08:47] <Flashfire> lol I dont know
[08:47] <Flashfire> Just wanted an opinion
[09:38] *** Sk1d has quit IRC (Read error: Operation timed out)
[09:40] *** Sk1d has joined #archiveteam-bs
[09:54] *** Sk1d has quit IRC (Read error: Operation timed out)
[09:56] *** Sk1d has joined #archiveteam-bs
[10:02] <godane> so i placed a bid on this set of magazines : https://www.ebay.com/itm/PC-Computing-magazine-45-issues-1993-1997-no-reserve/192702649323
[10:03] <godane> SketchCow: looks like there only 11 issues scanned of it : https://archive.org/search.php?query=%22PC%20Computing%20Magazine%22&and[]=mediatype%3A%22texts%22
[10:05] <Flashfire> Good luck godane
[10:05] <godane> the only issue to overlap is 1997-06 issue
[10:06] <godane> i was between that and the PCM magazine set : https://www.ebay.com/itm/PCM-Person-Computing-Magazine-for-Tandy-Computer-Users/323534021660
[10:06] <godane> that one has another 6+ days
[10:07] <godane> the PC Computing i have only 6 and half hours
[10:09] <Flashfire> I will continue adding to the FTP list until it ban no longer be ignored
[10:12] *** Sk1d has quit IRC (Read error: Operation timed out)
[10:15] *** Sk1d has joined #archiveteam-bs
[10:41] *** SmileyG has joined #archiveteam-bs
[10:41] *** Smiley has quit IRC (Read error: Operation timed out)
[10:54] *** BlueMaxim has quit IRC (Read error: Connection reset by peer)
[11:18] *** Sk1d has quit IRC (Read error: Operation timed out)
[11:21] *** Sk1d has joined #archiveteam-bs
[11:35] *** Sk1d has quit IRC (Read error: Operation timed out)
[11:39] *** Sk1d has joined #archiveteam-bs
[11:51] <Kaz> how often is the wiki backed up?
[11:52] *** Sk1d has quit IRC (Read error: Operation timed out)
[11:53] <JAA> Kaz: Our wiki, you mean? Weekly dumps at https://archiveteam.org/dumps/ which I believe are collected by SketchCow and uploaded to IA periodically (every few months or so).
[11:54] <Kaz> cool - yeah, just wanted to confirm the backups were actually being put somewhere other than on the host itself
[11:54] <Kaz> seems like there's been a lot of activity on it recently
[11:56] *** Sk1d has joined #archiveteam-bs
[12:18] *** NetwideRo has joined #archiveteam-bs
[12:21] <NetwideRo> I'm trying to archive a few sites from an old ARG, which requires passwords out of books to be sent in a form before you can get some of the content.  I've been using wget to make WARCs of all the easy static content, is there a recommended best way to archive the interactions that send passwords?
[12:23] <JAA> NetwideRo: Depends on how that login works. If it's an HTTP basic auth, then you can just pass a URL of the form http://username:password@example.org/path. If it's a form, then you need to use --post-data and --save-cookies (and likely --keep-session-cookies) to "log in" and save the cookies to a file, then --load-cookies plus --warc-* etc. to do the actual archival.
[12:24] <JAA> Assuming the form uses POST. If it uses GET, then just use the appropriate URL instead of --post-data.
[12:25] <SketchCow> I checked the backup situation, and yeah.
[12:25] <SketchCow> It's not automatically shoved into the item, but it was on FOS and ultimately I'm likely to put it in automatically.
[12:35] <JAA> Found a wonderful way to foster depression: going through sites marked as "closing" and archival "upcoming" on the wiki which are long gone and lost. :-|
[12:36] <NetwideRo> Thanks JAA.  Is it possible to save the password POST followed by a GET using the cookies in the same WARC?  The user is supposed to send one of several passwords to "lib/login.php" and then load "chapter.php" in the same session to get the page corresponding to the password they sent.  I want to save all of the possible interactions.
[12:38] <JAA> NetwideRo: I don't think that's possible with a single wget invocation. While you can specify more than one URL on the command line, that would cause wget to POST all of those URLs (with the same data).
[12:39] <JAA> Of course, you can also use --warc-* with the initial POST request. Since wget doesn't have a --warc-append option (unlike wpull), you'll have to merge the WARCs manually afterwards.
[12:39] <JAA> Regarding "in the same session", that's why I referred to --keep-session-cookies. That allows you to continue the same "session" across multiple invocations of wget.
[12:40] <JAA> (Specifically, "sessions" work through cookies that don't have an explicit expiration date. --keep-session-cookies causes wget to still write those cookies to the cookie file, and so when you --load-cookies them again on the next wget process, it'll have the same cookies and therefore behave as if it's the same session.)
[12:41] <NetwideRo> Merging should work.  Thanks for your help
[12:42] *** NetwideRo has left 
[12:42] <betamax> Well, my attempt to get all the campaign sites for the midterms is going terribly
[12:42] <betamax> been running for 12+ hours and only got ~110 of 6,000
[12:43] <JAA> betamax: Homepages or entire websites?
[12:43] <betamax> entire websites
[12:43] <betamax> it's because many have event calandars
[12:43] *** Sk1d has quit IRC (Read error: Operation timed out)
[12:43] <betamax> with a page for every day
[12:43] <JAA> Ah, right.
[12:43] <JAA> Calendars are the wors.t
[12:43] <betamax> and they go back forever, so I have to notice it's stuck on one and cancel the archiving
[12:44] <betamax> don't suppose there's any easy way round this?
[12:45] *** VerifiedJ has joined #archiveteam-bs
[12:46] *** Sk1d has joined #archiveteam-bs
[12:46] <JAA> betamax: You could limit the depth of archival using --limit (assuming you're using wget or wpull for this). There's a small chance it might miss some content with this though.
[12:48] <JAA> Other than that, there is no generic way to handle this. You can add patterns to filter those calendars out beyond certain dates, but the required pattern will depend on the exact calendar software used.
[12:51] *** alex____ has quit IRC (Ping timeout: 360 seconds)
[12:54] *** alex__ has joined #archiveteam-bs
[12:59] <betamax> JAA: is there a way to abort wget gracefully without damaging the warc its creating?
[13:00] <betamax> (if so, I'll make it stop after a max of 3 minutes, as most of the time it takes <1 minute to archive a site)
[13:00] <JAA> betamax: No idea. I don't use wget for archival.
[13:03] <betamax> ah, well. I'll do it non-gracefully and make a list of all the ones that were aborted, to be done later
[13:06] <JAA> The number of dead paste links on our wiki is eye-watering.
[13:28] *** Stilett0 has joined #archiveteam-bs
[13:28] *** Stiletto has quit IRC (Ping timeout: 265 seconds)
[13:54] <hiroi> The wiki page watch emails have “localhost” set as the host name for the wiki. Can someone with admin access fix it? Thx
[13:55] <Kaz> cc jrwr SketchCow ^
[13:55] *** dashcloud has quit IRC (No Ping reply in 180 seconds.)
[13:57] *** dashcloud has joined #archiveteam-bs
[14:44] <SketchCow> Need jrwr to do it, he's the smart one.
[14:44] <SketchCow> JAA: I'd like to set up archiving for a couple outlink parts of the archive.
[14:44] <SketchCow> I also wanted to do so for justsolve.archiveteam.org
[14:45] <SketchCow> JAA: And yes, as for missed opportunities, let's cut back on them now
[14:45] <SketchCow> Hence I'm back, baby
[14:47] * anarcat waves
[14:47] <anarcat> so a couple of fellows here have setup a server to mirror ... er... brasil
[14:47] <anarcat> we've got 160GB of videos and ... hmm... about 30GB of sites so far
[14:47] <anarcat> i think a lot of that stuff should go in a collection in IA
[14:48] <anarcat> maybe not the videos, but definitely the site crawls
[14:48] <anarcat> we've been running grab-site around the clock to save sites before the political turnover over there
[14:48] <anarcat> and i've been wondering how to better coordinate with you folks
[14:49] <anarcat> so far i've thrown a few sites at archivebot, but have mostly stopped because i don't want to clog the queue
[14:49] <SketchCow> Well, first of all, if you do it outside of archiveteam structure, it won't go in wayback.
[14:49] <anarcat> ah
[14:49] <SketchCow> But you can certainly upload the WARCs you're making and they'll go into warczone on IA
[14:49] <SketchCow> Which is something
[14:49] <anarcat> right
[14:49] <anarcat> i've uploaded smaller WARC files to IA before and someone (you i think?) have been nice enough to add it to IA
[14:50] <SketchCow> Yeah, that got shutdown, but you can upload them to the general sets and I'll have scripts that see them and fling them in WARCzone
[14:50] <anarcat> git-annex says we got 34.76 GB of site data so far
[14:50] <jrwr> I HAVE BEEN SUMMONED
[14:50] <jrwr> Whats up SketchCow 
[14:51] <anarcat> SketchCow: could we get a collection to regroup that stuff?
[14:51] <SketchCow> 1. Apparenty we have a situation that e-mails from archiveteam wiki defined as "localhost"
[14:51] <SketchCow> < hiroi> The wiki page watch emails have "localhost" set as the host name for the wiki. Can someone with admin access fix it? Thx    
[14:51] <jrwr> Oh, Ya I can fix that SketchCow, I didn't keep the logins for the box, so let me dig into my archives and see if I can find them
[14:51] <SketchCow> That'd be good.
[14:51] <SketchCow> Also, is it JAA or jrwr who is responsible for this storify blob on FOS
[14:52] <SketchCow> Because work needs to start on that soon or you'll have to do it later sometime
[14:52] <jrwr> not me
[14:55] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
[14:55] <jrwr> So, I have everything but the cpanel password
[14:55] <jrwr> I think we used hang outs for that SketchCow 
[14:56] *** dashcloud has joined #archiveteam-bs
[15:00] <jrwr> nvm, I dug it up, that was interesting how long it keeps those
[15:00] <SketchCow> Yeah
[15:01] *** Sk1d has quit IRC (Read error: Operation timed out)
[15:04] *** Sk1d has joined #archiveteam-bs
[15:05] <jrwr> @hiroi can I get a headers dump from the email, I've got the senders set in mediawiki
[15:06] <anarcat> SketchCow: so i dump everything in "opensource" and tell you about items?
[15:06] <SketchCow> Yes
[15:06] <anarcat> SketchCow: wouldn't it be easier to group this into collections?
[15:06] <SketchCow> Or I'll find them
[15:06] <SketchCow> And then at some point, we'll make a big-ol collection and you can then keep uploading there
[15:06] <anarcat> ah
[15:06] <anarcat> why not make the collection first?
[15:09] <jrwr> nvm, I was dumb -- should be fixed now SketchCow Kaz hiroi 
[15:12] <SketchCow> OK, great
[15:12] <SketchCow> anarcat: Silly IA rule, we don't make new collections before items arrive unless there's contracts and shiznat
[15:12] <SketchCow> Cuts down on a lot of "TBD" collections
[15:13] <anarcat> SketchCow: what if we create an archiveteam project here?
[15:13] <anarcat> SketchCow: what we're doing is basically https://archiveteam.org/index.php?title=Government_Backup but for .br
[15:13] <anarcat> i've been hesitant in creating a wiki page and channel and all that stuff because i'm unfamiliar with the process
[15:13] <anarcat> but if that's the way to go, i'm all in
[15:13] <anarcat> we've got stuff to throw at warriors too
[15:13] <anarcat> if i can figure out how that works
[15:14] <anarcat> and large CKAN datasets and so on
[15:15] <jrwr> updated the wiki to 1.31.1 
[15:16] *** Sk1d has quit IRC (Read error: Operation timed out)
[15:20] <jrwr> rebuilding caches now
[15:21] *** Sk1d has joined #archiveteam-bs
[15:21] <jrwr> SketchCow: have the backups been showing up in FOS from Archiveteam wiki
[15:21] <SketchCow> Yes
[15:22] <SketchCow> And I finally dumped them into the IA infrastructure
[15:22] <jrwr> nice
[15:22] <jrwr> should help in case of explosions
[15:22] <SketchCow> teamarchive2:/1/CHFOO/warrior/at.org-wiki-dumps# ls -l
[15:22] <SketchCow> total 485716
[15:22] <SketchCow> -rw-r--r-- 1 1001 1001 497361153 Nov  2 08:01 archiveteam-dump-2018-11-02.xml.gz
[15:22] <SketchCow> -rw-r--r-- 1 root root       107 Nov  2 00:06 UPLOAD_TO_INTERNETARCHIVE.sh
[15:23] <SketchCow> I'll be making that .sh run automatically daily shortly
[15:23] <JAA> SketchCow: Storify is me, yeah. Sorry about that.
[15:23] <SketchCow> Move on it, cutie. I have a guy from Poland sending me 15tb of crap
[15:23] <SketchCow> I can stay on top of him because he's "only" doing 3-10gb a day, but who knows
[15:24] <jrwr> cool
[15:25] * anarcat got 180GB at day six :)
[15:25] <anarcat> i'll start uploading to archive.org now
[15:26] <SketchCow> OK, I made it automatic before I forget to make it automatic
[15:26] <SketchCow> More a "hey the backups aren't working false alarms" than "disk space", obviously
[15:27] <jrwr> cool
[15:27] <SketchCow> Currently, the "inbox" folder at FOS (that's the archivebot and similar shiznat) is at 89mb, which is good cheese
[15:28] <SketchCow> The ARCHIVETEAM folder (which is the one that is handed all the stuff from inbox) is at 720k
[15:29] * jrwr has never seen SketchCow use the word shiznat and believes IRC is corrupting him
[15:29] <SketchCow> The ARCHIVEBOT folder is 240gb
[15:29] <SketchCow> Mostly due to an infrastructure backup (S3 is going barf)
[15:30] <SketchCow> https://twitter.com/textfiles/status/597925997288099840
[15:30] <SketchCow> (Me saying Shiznat in Twitter in 2015)
[15:30] <SketchCow> Also, I've been on IRC since 1998.
[15:31] <SketchCow> So literally 20 years
[15:31] <anarcat> kids these days
[15:32] * jrwr has been on since 2000
[15:32] <jrwr> it ruined my grammer
[15:32] <jrwr> since I was (I'm going to date my self) 10 years old on a webtv talking on Talk City
[15:33] * jut was born in 2000
[15:35] <SketchCow> Yes, you've only known a time with IRC
[15:35] <SketchCow> It's sad.
[15:35] <SketchCow> So sad
[15:36] <SketchCow> So, in 1995, babies never died and we all were fit and healthy
[15:36] <SketchCow> I'm sorry you missed it
[15:36] <anarcat> that is kind of sad
[15:37] *** Sk1d has quit IRC (Read error: Operation timed out)
[15:39] <jut> I missed many things, what is sadder that a 10 year old only knows a world with smartphones
[15:39] <jrwr> I missed out on BBSes
[15:39] <jrwr> I would of ran one
[15:39] <jrwr> would of been king of the 777 seas
[15:40] <JAA> would have*
[15:40] *** Sk1d has joined #archiveteam-bs
[15:40] <SketchCow> Shhh, he'd totally of fit in back there
[15:40] <jrwr> I do infosec CTFs and I still do scavenger hunts on Textfiles
[15:41] <jrwr> fun to go pull crap out of old posts or disk images and make people hunt them down using a set of clues
[15:42] <jrwr> I do have a 6502 badge that I really want to port CP/M to
[16:19] <anarcat> SketchCow: a tiny test crawl https://archive.org/download/memorialanistia.org.br
[16:19] <anarcat> https://archive.org/details/cnv.memoriasreveladas.gov.br is a larger one being uploaded
[16:20] <anarcat> i have 31 more crawls to upload, totaling 30+GB
[16:20] <anarcat> SketchCow: do those look okay to you? should i keep going?
[16:38] *** SimpBrain has quit IRC (Ping timeout: 252 seconds)
[17:07] <anarcat> gaah why is IA's s3 failing now (403 Forbidden. "The request signature we calculated does not match the signature you provided.") - yet i create a few buckets already without problems
[17:07] <anarcat> are there quotas or something?
[17:17] *** wp494 has quit IRC (Read error: Operation timed out)
[17:17] *** wp494 has joined #archiveteam-bs
[17:20] <adinbied> Any idea on what the issue with the tracker is? Something with the Redis side?
[17:25] <PurpleSym> betamax: Yeah, same for the chromebot grab. About 1/4th for Facebook, nothing for Twitter yet. I need a bigger machine.
[17:35] <adinbied> Also, I've now got the angelfire scripts done and working correctly (as far as I can tell) - if someone (JAA, arkiver, or whoever) can take a look at everything and create an ArchiveTeam github repo for them to get ready for Warriors, that would be great.
[17:35] <adinbied> https://github.com/adinbied/angelfire-grab
[17:35] <arkiver> sure
[17:36] <adinbied> https://github.com/adinbied/angelfire-items
[17:38] <arkiver> it looks pretty good already, but needs a few changes
[17:40] *** Sk1d has quit IRC (Read error: Operation timed out)
[17:40] <arkiver> adinbied: I´ve made https://github.com/orgs/ArchiveTeam/teams/angelfire
[17:40] <adinbied> OK, feel free to make any adjustments you need to or let me know what needs changing - I'm still learning the basics of Lua
[17:40] <adinbied> Sounds good - thanks
[17:41] <arkiver> and forked the repos into the archiveteam github
[17:41] <arkiver> and the team
[17:44] *** Sk1d has joined #archiveteam-bs
[17:48] <arkiver> do you have an example of a large user?
[17:48] <arkiver> adinbied: ^
[17:49] <adinbied> I have it somewhere, hold on....
[17:50] <jodizzle> betamax: Did you grab candidate/campaign Instagrams as well?  There are at least a few Instagrams listed on Vote411.
[17:51] <arkiver> so the scripts look fine overall
[17:51] <adinbied> arkiver, id2/silas has ~2300 URLS in its sitemap and is one of the largest
[17:51] <arkiver> but i think they could go off to other users, since there´s not checking in the allowed function if a page is allowed since it is from a certain user
[17:52] <adinbied> Although alot of those are 404ing
[17:52] <JAA> adinbied, arkiver: Please move this to #angelonfire.
[17:52] <adinbied> Got it
[17:53] <SketchCow> anarcat: I wish you folks had metadata to go with these items.
[17:53] <SketchCow> They certainly parse as WARCs.
[17:57] <anarcat> SketchCow: which metadata would you like to see in there?
[17:57] *** Sk1d has quit IRC (Read error: Operation timed out)
[18:00] *** Sk1d has joined #archiveteam-bs
[18:13] <adinbied> Heads up, the ./get-wget-lua.sh script is also broken as it queries warriorhq.archiveteam.org 
[18:14] <adinbied> Getting a connection refused
[18:15] <JAA> Yup, same server.
[18:16] * anarcat waves at JAA 
[18:16] <anarcat> JAA: pm?
[18:25] <betamax> jodizzle: no, unless my hastily-written scraper mistook an instagram url for a campaign website
[18:25] <betamax> so maybe one or two, but I didn't specifically targe them
[18:26] <betamax> if you have any lists of instagram pages, I have a spare machine I can run snscrape on them to get the links, though
[18:40] <jodizzle> betamax: Don't have them all, but I can try doing some of my own scraping later today.  It seems like Instagram links are pretty rare.
[18:40] *** SimpBrain has joined #archiveteam-bs
[18:41] <jodizzle> Not that the candidates don't have Instagrams, but that the links aren't on Vote411 for whatever reason (I think candidates probably submit their own info to Vote411)
[18:44] <JAA> anarcat: Anytime.
[18:44] <jodizzle> From a grep of your files it doesn't seem like you picked up any
[18:45] <betamax> yeah, I think Vote411 reaches out to candidates to get info, so quite a lot are empty or incomplete
[19:46] *** odemg has joined #archiveteam-bs
[19:46] <odemg> arkiver, can we stick this on the tracker, distribute this shii, it's got a nice api https://case.law/about/
[19:47] <odemg> `40 million pages of US court decisions`
[19:47] <odemg> `currently you may view or download no more than 500 cases per day`
[19:47] <arkiver> cool
[19:47] <arkiver> we should totally get that
[19:47] <odemg> <3
[19:50] *** t2t2 has quit IRC (Remote host closed the connection)
[19:52] *** t2t2 has joined #archiveteam-bs
[20:06] <Kaz> 500/day/account, that could be an issue
[20:19] <odemg> Kaz, slow but if we get enough people/ips we're good 
[20:19] <odemg> ipv6/
[20:19] <odemg> ?
[20:20] <Kaz> yeah, adding the accounts will be a pain though, unless we just make the warriors pull in a list from somewhere
[20:20] <odemg> accounts? We dont need accounts to hit the api right?
[20:20] <odemg> doesn't seem so 
[20:22] <Kaz> https://usercontent.irccloud-cdn.com/file/AHCDUhWc/image.png
[20:22] <Kaz> not too sure what that means
[20:26] <jut> What does "bulk download" mean?
[20:26] <jut> Because it sounds like what we would like to do
[20:27] <Kaz> bulk download sounds like something we're not going to get
[20:27] <Kaz> because it sounds like they'll ask for justification
[20:27] <Kaz> and "we want to download all your data, stick it up elsewhere and distribute it ourselves" probably doesn't tick all their boxes
[20:30] <jut> Well IA has an option to set items dark doesn't it? So "we want to download all your data, stick it up elsewhere juat in case"
[20:30] <jut> *just
[20:35] *** PurpleSym sets mode: +o arkiver
[20:36] *** PurpleSym sets mode: +oooo HCross Kaz dxrt joepie91_
[20:36] *** dxrt sets mode: +o dxrt_
[20:43] *** icedice has joined #archiveteam-bs
[20:51] <betamax> so I have a new method for getting the campaign sites:
[20:51] <betamax> run wget, mirroring and creating a warc
[20:51] <betamax> if it doesn't finish in 120 seconds, cancel it, and add it to a list of 'long' sites
[20:51] <betamax> this works because the majority of sites are small and so finish in ~40 seconds
[20:58] *** anarcat has quit IRC (Read error: Connection reset by peer)
[21:03] *** icedice has quit IRC (Quit: Leaving)
[21:05] *** anarcat has joined #archiveteam-bs
[21:17] *** Ryz has joined #archiveteam-bs
[21:18] <Ryz> Huh, this might one of their A to B test results they've been pulling, I couldn't get a snapshot because I thought the hiding of search number results is permament like YouTube's
[21:20] <Ryz> I mentioned this in the main channel because it may affect how the archiving process would work if there's no quick glimpse of how large the website could be
[21:30] *** schbirid has quit IRC (Remote host closed the connection)
[21:35] *** BlueMax has joined #archiveteam-bs
[21:46] *** igglybee has joined #archiveteam-bs
[21:48] <igglybee> Is Archive-Bot going to change to SlimmerJS?
[22:04] <Flashfire> Arkiver any news on #effteepee
[22:37] <Flashfire> http://ftp.arnes.si/ftp/software/unix/perl/CPAN/modules/by-category/15_World_Wide_Web_HTML_HTTP_CGI/WWW/CLSCOTT/
[22:37] <Flashfire> Bebo is listed as lost on our wiki
[22:37] <Flashfire> I found this while scanning for FTP servers 
[22:37] <Flashfire> Anyone think its useful?
[22:38] <JAA> These particular files are also on CPAN though: https://cpan.metacpan.org/authors/id/C/CL/CLSCOTT/
[22:38] <JAA> (At least that's what it looks like; didn't check.)
[22:39] <Flashfire> It looks to be a mirror of CPAN on an FTP server but that part it beside the point. If "Version 1" is listed as lost and these are early API is data able to be extracted?
[22:40] <JAA> < Flashfire> Bebo is listed as lost on our wiki  -- I think that refers to the site, not this Perl module.
[22:41] <Flashfire> Ok then. My bad. Thought it could be useful
[22:54] *** Sk1d has quit IRC (Read error: Operation timed out)
[22:57] *** Sk1d has joined #archiveteam-bs
[23:07] <igglybee> https://slimerjs.org/
[23:15] <JAA> igglybee: Not very likely. There are massive issues with the way PhantomJS, SlimerJS & Co. work. I'd rather spend my time developing a usable tool based on an actual browser (Firefox or Chromium Headless) than trying to replace PhantomJS with another mediocre solution. Others might disagree though. I wouldn't be opposed to accepting a PR that does all the necessary work to integrate SlimerJS (or 
[23:15] <JAA> whatever else) into wpull in general though.
[23:16] *** VerifiedJ has quit IRC (Quit: Leaving)
[23:16] <igglybee> JAA: It apparantly has the same api has phantomjs, so there should be minimal work needed
[23:16] <igglybee> *as phantoms
[23:17] <igglybee> phantomjs
[23:17] <igglybee> https://github.com/laurentj/slimerjs/blob/master/API_COMPAT.md
[23:18] <JAA> I see. Well, that would still leave those issues we had with PhantomJS, the main one being the massive duplication in the archives because all page requisites (images, stylesheets, scripts, fonts, etc.) were redownloaded for every single page.
[23:19] <igglybee> Wouldn't it still be better than nothing though?
[23:20] <igglybee> and don't browser-based things have the same issue?
[23:21] <igglybee> I'm assuming someone should be able to plop it in and it would mostly work
[23:22] <ivan> do you want to maintain it? :-)
[23:22] <JAA> Well, you can try it out by specifying SlimerJS through --phantomjs-exe. If it is indeed compatible, then it should work.
[23:25] <hook54321> I doubt I'll be able to figure it out, but if no one else wants to try it I can I guess
[23:25] <hook54321> What's the correct wpull fork to use?
[23:26] <JAA> hook54321: ArchiveTeam/wpull, but install from Git, not from PyPI.
[23:26] <JAA> 2.0.3 is currently not on PyPI.
[23:30] *** Ing3b0rg has quit IRC (Quit: woopwoop)
[23:30] *** anarcat has quit IRC (Read error: Connection reset by peer)
[23:31] *** Ing3b0rg has joined #archiveteam-bs
[23:32] *** bakJAA has quit IRC (Ping timeout: 260 seconds)
[23:32] *** plue has quit IRC (Ping timeout: 260 seconds)
[23:33] <moufu> are there page requisites phantomjs can't download, but wpull can? if not, why not just exclude -p when --phantomjs is used?
[23:33] *** betamax has quit IRC (Remote host closed the connection)
[23:34] *** betamax has joined #archiveteam-bs
[23:34] *** Sk1d has quit IRC (Read error: Operation timed out)
[23:34] <hook54321> I don't know what that means
[23:34] *** plue has joined #archiveteam-bs
[23:35] *** bakJAA has joined #archiveteam-bs
[23:35] *** swebb sets mode: +o bakJAA
[23:35] *** JAA sets mode: +o bakJAA
[23:35] <moufu> I was referring to the duplication issue JAA mentioned
[23:35] *** anarcat has joined #archiveteam-bs
[23:36] <hook54321> JAA: I tried to install it, this was the result.
[23:36] <hook54321> https://www.irccloud.com/pastebin/5wRqVIbL/
[23:37] <JAA> moufu: I *think* that PhantomJS will download page requisites regardless of whether or not --page-requisites is specified. I also think that PhantomJS will only download the page requisites actually necessary for display, e.g. if there are multiple font formats, wpull would fetch all but PhantomJS itself wouldn't. But I don't actually know.
[23:38] *** BlueMax has quit IRC (Read error: Connection reset by peer)
[23:38] <JAA> hook54321: pip uninstall tornado && pip install tornado==4.5.3  (with --user as appropriate; assuming you're in a venv or something)
[23:38] *** Sk1d has joined #archiveteam-bs
[23:38] <JAA> And while you're at it, make sure you have html5lib==0.9999999 (seven nines).
[23:38] <hook54321> 👀
[23:38] <hook54321> ok
[23:39] *** BlueMax has joined #archiveteam-bs
[23:40] <moufu> makes sense
[23:42] *** Odd0002 has quit IRC (Ping timeout: 260 seconds)
[23:45] *** Odd0002 has joined #archiveteam-bs
[23:48] <hook54321> https://www.irccloud.com/pastebin/jslcmSyH/
[23:49] <JAA> Oh, SlimerJS is a Python script? Hmm.
[23:49] <JAA> I guess it's not marked as executable (and is lacking the appropriate shebang maybe)?
[23:50] <Flashfire> IT WILL NEVER STOP
[23:50] <hook54321> ?
[23:54] *** Sk1d has quit IRC (Read error: Operation timed out)
[23:55] *** anarcat has quit IRC (Read error: Operation timed out)
[23:56] <Flashfire> Oh I am adding ftp sites to the list until someone restarts the scripts
[23:56] <hook54321> JAA: What did you expect it to be?
[23:56] <hook54321> https://www.irccloud.com/pastebin/gdDkfyRq/
[23:56] <hook54321> wait
[23:56] <Flashfire> I know that ftp://ftp.aztech.com/ contains stuff cause thats how I found it but it doesnt appear indexed......
[23:56] <hook54321> https://www.irccloud.com/pastebin/3gwQ7D1o/
[23:56] <hook54321> agh, I need firefox 59
[23:56] *** anarcat has joined #archiveteam-bs
[23:57] <JAA> :-/
[23:58] <anarcat> ffs what's up with this fucking router
[23:59] <anarcat> https://forum.turris.cz/t/omnia-unreachable/8616
[23:59] *** Sk1d has joined #archiveteam-bs