[00:07] *** i0npulse has quit IRC (Ping timeout: 252 seconds) [00:14] *** Zerote_ has joined #archiveteam-bs [00:14] *** Zerote_ has quit IRC (Read error: Connection reset by peer) [00:15] *** Zerote_ has joined #archiveteam-bs [00:15] *** Zerote has quit IRC (Leaving) [00:16] *** britmob has joined #archiveteam-bs [00:33] *** i0npulse has joined #archiveteam-bs [00:35] would it be possible to get archivebot to crawl https://en.wikipedia.org/wiki/Timeline_of_the_2019_Turkish_offensive_into_north-eastern_Syria? I was hoping the IA save-page-now would crawl the references, but it didn't... [00:52] *** LowLevelM has joined #archiveteam-bs [00:52] betamax: Thanks, there is probably enough warriors running urlteam right now [00:53] well, we could always do with more people completing CAPTCHAs to help save Yahoo Groups (shutting down in December) : https://github.com/davidferguson/yahoogroups-joiner [00:53] disclaimer: I'm helping coordinate that project [00:55] Sounds cool, I will do that. [00:55] Thanks [00:55] channel is #yahoosucks [00:56] and there's a leaderboard: http://tinyurl.com/ygleaders [00:56] paul2520: You can run that page with !ao, and that should grab all the outgoing links. [00:57] Also, IA save-page-now can do it now, I think, but you might have to select some tick boxes before saving the page. [01:12] *** Raccoon has quit IRC (Remote host closed the connection) [01:21] jodizzle: !ao doesn't grab any links. [01:24] Oh. Does it just get page requisites then? [01:25] Yes [01:26] You'll want to create a list of the outlinks and run those with !ao <. [01:38] does wikiteam provide better coverage of wikipedia? [01:40] paul2520: I threw the citation links in as an '!ao <' job [01:53] There are regular dumps of all Wikimedia projects which are also uploaded to IA by someone on here I believe. And all Wikipedia outlinks get archived continuously as they're added to articles by IA. [01:53] Or at least that used to be the case; haven't checked it in a while. [01:54] I assume it was one of you, but that page is in the wayback machine now. [02:07] *** LowLevelM has quit IRC (Ping timeout: 260 seconds) [02:58] *** Raccoon has joined #archiveteam-bs [03:25] *** Ivy has joined #archiveteam-bs [03:26] *** manjaro-u has quit IRC (Ping timeout: 252 seconds) [03:28] thanks jodizzle... the save-page-now has "all outlinks" as a checkbox, but I noticed many of the URLs in the references didn't appear even after everything seemed finished [03:29] I appreciate you running the !ao... I was under the impression I do not have privileges to actually kick-off archivebot jobs myself. Should I try !ao for small, one-off requests like this in the future? [03:29] ...or is there a way to get archivebot privileges? [03:34] paul2520: You should be able to use '!ao' and '!ao <' without special privileges [03:35] Using '!a' does require special privileges [03:37] Someone with #archivebot ops would need to decide to give those privileges to you [04:05] *** bluefoo has quit IRC (Ping timeout: 252 seconds) [04:11] *** bluefoo has joined #archiveteam-bs [04:36] *** qw3rty has joined #archiveteam-bs [04:41] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [04:46] *** qw3rty2 has quit IRC (Ping timeout: 745 seconds) [05:19] *** HP_Archiv has joined #archiveteam-bs [05:19] Heya [05:19] Anyone here? [05:47] so if anyone has twitter show you no images in firefox you just have to clean your cache history [06:31] *** bluefoo has quit IRC (Read error: Connection reset by peer) [06:32] *** killsushi has quit IRC (Quit: Leaving) [06:33] *** bluefoo has joined #archiveteam-bs [06:43] *** bluefoo has quit IRC (Quit: bluefoo) [06:49] *** bluefoo has joined #archiveteam-bs [07:48] *** bluefoo has quit IRC (Quit: bluefoo) [07:51] *** bluefoo has joined #archiveteam-bs [08:23] *** bluefoo has quit IRC (Ping timeout: 252 seconds) [08:35] *** Ivy has quit IRC (Quit: Connection closed for inactivity) [09:19] *** bluefoo has joined #archiveteam-bs [09:28] *** bluefoo has quit IRC (Ping timeout: 255 seconds) [09:29] *** bluefoo has joined #archiveteam-bs [10:00] *** HP_Archiv has quit IRC (Quit: Page closed) [10:01] *** HP_Archiv has joined #archiveteam-bs [11:02] *** bluefoo has quit IRC (Read error: Operation timed out) [11:20] *** HP_Archiv has quit IRC (Ping timeout: 260 seconds) [11:24] *** bluefoo has joined #archiveteam-bs [11:45] *** Smiley has quit IRC (Read error: Operation timed out) [11:48] *** Smiley has joined #archiveteam-bs [12:31] *** Smiley has quit IRC (Ping timeout: 496 seconds) [12:45] *** Smiley has joined #archiveteam-bs [13:33] *** BlueMax has quit IRC (Quit: Leaving) [15:04] JAA: Agreed, and they're going to move. [15:04] I just needed to get them into the system and functioning. [15:09] *** schbirid has joined #archiveteam-bs [15:16] *** Stilettoo has joined #archiveteam-bs [15:17] *** Ravenloft has joined #archiveteam-bs [15:24] *** manjaro-u has joined #archiveteam-bs [15:26] *** Stiletto has quit IRC (Ping timeout: 745 seconds) [15:36] *** odemgi has quit IRC (Remote host closed the connection) [15:38] Looks like Archivebot backlog on FOS has diminished [15:51] *** Stiletto has joined #archiveteam-bs [15:51] *** Stilettoo has quit IRC (Read error: Operation timed out) [16:15] And is now gone. [16:15] Also, apparently my archivebot screenshotter service got reboot two weeks ago. [16:17] Olympics items are now being moved out of archivebot, and into archiveteam_inbox, where they belong. [16:17] Didn't follow my own rules! [16:17] And, the added archivebot items through alternate pipelines in inbox are now being automatically redirected to archivebot. [16:18] Sweet, thanks. [16:19] I'll be creating collections/redirections for the inbox items that need a home. [16:20] The screenshots in the archivebot are looking really sweet [16:21] I have a fundamental question as to whether the way I current screenshot the items, the time it takes, can keep up with the speed at which new items arrive. [16:21] Currently 7-10 arrive. I think that's more than this thing can do in a day. [16:22] We could always do it at upload SketchCow [16:23] What, generate screenshots? It seems like a needless burden [16:23] I'll put it this way [16:24] If my thing sees something has screenshots, it skips and moves on. If someone is generating them and uploading them, then my thing won't do its work. [16:24] It would call it a "archivebot 3.0 nicety" [16:25] Like, if people are being archivebot pipelines, and then they run what I'm running to generate the screenshots (and I'm doing a pretty hacky thing) then include them... great. [16:25] I just don't want to add more friction to an already 10% fragile process [16:26] My fun little post-processing of items comes way after everyone is already using the data. [16:27] Makes sense, Hard for anyone to do it outside of IA's network due to needing to download the WARC [16:28] And the delays that come with it [16:28] Yeah. [16:28] I am downloading 5gb WARC sets to do this [16:28] To generate a single screenshot [16:28] Take that, carbon footprint [16:32] I wrote a script to generate a list of the top 20-30 pages of the archivebot collection and do the items in there first, so the collection already looks pretty nice for 99% of the manual browsings that will happen in it. [16:32] Brewster was happy, I was happy. [16:34] *** manjaro-u has quit IRC (Quit: Konversation terminated!) [16:36] https://archive.org/details/archiveteam_2018_olympics [16:43] https://archive.org/details/archiveteam_24syv [16:45] *** eientei95 has quit IRC (Read error: Connection reset by peer) [16:45] *** eientei95 has joined #archiveteam-bs [16:50] *** Ravenloft has quit IRC (Read error: Operation timed out) [16:51] *** mistym- has joined #archiveteam-bs [16:53] *** mistym has quit IRC (Ping timeout: 745 seconds) [16:55] https://archive.org/details/archiveteam_yourshot [17:26] *** DogsRNice has joined #archiveteam-bs [17:32] *** apache2 has quit IRC (Remote host closed the connection) [17:32] *** Mateon1 has quit IRC (Write error: Broken pipe) [17:32] *** Mateon1 has joined #archiveteam-bs [17:32] *** apache2 has joined #archiveteam-bs [17:36] https://archive.org/search.php?query=mediatype%3Acollection%20description%3A%2Aforthcoming%2A [17:36] All of these are collections I need to add descriptions to [18:07] *** systwi_ is now known as systwi [18:30] *** X-Scale` has joined #archiveteam-bs [18:32] *** X-Scale has quit IRC (Ping timeout: 252 seconds) [18:32] *** X-Scale` is now known as X-Scale [18:44] *** odemgi has joined #archiveteam-bs [18:59] what's the background history on "WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD" [19:00] btw, duckduckgo gives a few archiveteam results, then launches into some russian cp comment, before arriving at shakespeare :) [19:06] *** manjaro-u has joined #archiveteam-bs [19:22] *** sotty has joined #archiveteam-bs [19:56] *** X-Scale` has joined #archiveteam-bs [19:57] *** X-Scale has quit IRC (Ping timeout: 252 seconds) [19:57] *** X-Scale` is now known as X-Scale [20:10] *** legoktm has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [20:11] *** legoktm has joined #archiveteam-bs [20:22] *** schbirid has quit IRC (Quit: Leaving) [21:15] *** icedice has quit IRC (Ping timeout: 252 seconds) [21:21] *** wyatt8740 has joined #archiveteam-bs [21:29] *** icedice has joined #archiveteam-bs [22:10] *** Panasonic has joined #archiveteam-bs [22:36] thanks jodizzle [22:47] what did you do to get the https://transfer.notkiska.pw/oPAdn/wikipedia-Timeline_of_the_2019_Turkish_offensive_into_north-eastern_Syria-citations.txt file? [22:51] *** wyatt8740 has quit IRC (Read error: Operation timed out) [22:54] Gah, why are government sites always so awful? [22:55] Looking into https://www.fbo.gov/ currently. It does POST requests, stores the search parameters in a session store I assume and then does pagination with GET and cookies. :-| [22:56] It's being migrated to an SPA site. Not sure what's wrose. [22:56] worse* [22:57] Yeah, going through government websites is like taking a trip through different web design patterns. [22:57] Probably a consequence of cheap contracting, in a lot of cases. [22:57] Some of them are pretty nice though. No JS, lightweight [22:58] paul2520: I did some work in a Python shell to request that Wikipedia page and fetch the citation links with CSS selectors. [22:59] I'll be grabbing FBO with qwarc shortly, and it'll be a pain. [22:59] I wonder how well the pagination of 3 million results will work. [23:08] that sounds great jodizzle -- feel like putting it in a gist? [23:10] paul2520: Sure, I can try and dig up what I did later. [23:26] *** BlueMax has joined #archiveteam-bs [23:56] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [23:56] *** BartoCH has quit IRC (Remote host closed the connection) [23:59] *** RichardG has joined #archiveteam-bs