[00:00] It's also possible that all of this will go away when I move away from keeping everything in memory. That's also a significant refactor though. [00:00] *** britmob has joined #archiveteam-bs [00:19] *** godane has quit IRC (Ping timeout: 252 seconds) [01:03] *** freemint has quit IRC (Ping timeout: 252 seconds) [01:07] *** freemint has joined #archiveteam-bs [01:55] *** freemint has quit IRC (Ping timeout: 252 seconds) [02:02] *** freemint has joined #archiveteam-bs [02:03] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [02:07] *** freemint has quit IRC (Remote host closed the connection) [02:07] *** freemint has joined #archiveteam-bs [02:29] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [03:00] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [03:12] *** freemint has quit IRC (Ping timeout: 252 seconds) [03:23] *** freemint has joined #archiveteam-bs [03:36] *** odemgi has joined #archiveteam-bs [03:37] *** odemg has quit IRC (Read error: Operation timed out) [03:40] *** RichardG has joined #archiveteam-bs [03:42] *** odemgi_ has quit IRC (Read error: Operation timed out) [03:45] *** qw3rty2 has joined #archiveteam-bs [03:51] *** qw3rty has quit IRC (Ping timeout: 745 seconds) [03:51] *** odemg has joined #archiveteam-bs [03:51] *** freemint has quit IRC (Remote host closed the connection) [05:15] *** wp494 has joined #archiveteam-bs [05:44] archivebot thumbnailer has now been shifted to a "make thumbnails in reverse number of views order" [05:44] Mostly means the archivebot collection will get thumbnails backfilling the ones without them starting from the top, going down. Going to just let that keep running to the end of the year. [05:52] *** godane has joined #archiveteam-bs [07:01] *** william has joined #archiveteam-bs [07:28] *** VADemon has quit IRC (Read error: Connection reset by peer) [07:29] *** VADemon has joined #archiveteam-bs [07:30] *** slyphic has quit IRC (Read error: Operation timed out) [07:30] *** slyphic has joined #archiveteam-bs [07:34] *** Mateon1 has quit IRC (Ping timeout: 255 seconds) [07:35] *** Mateon1 has joined #archiveteam-bs [07:37] *** jspiros has joined #archiveteam-bs [07:40] *** HashbangI has joined #archiveteam-bs [07:56] *** wp494 has quit IRC (Read error: Operation timed out) [07:56] *** william has quit IRC (Remote host closed the connection) [07:58] *** slyphic has quit IRC (Read error: Operation timed out) [07:58] *** slyphic has joined #archiveteam-bs [08:04] *** VADemon_ has joined #archiveteam-bs [08:04] *** super3_ has quit IRC (Read error: Operation timed out) [08:05] *** super3_ has joined #archiveteam-bs [08:05] *** omglolbah has quit IRC (Read error: Operation timed out) [08:06] *** omglolba^ has quit IRC (Read error: Operation timed out) [08:08] *** omglolbah has joined #archiveteam-bs [08:09] *** VADemon has quit IRC (Read error: Operation timed out) [08:14] *** omglolba- has joined #archiveteam-bs [08:47] *** BlueMax has quit IRC (Read error: Connection reset by peer) [09:08] *** Atom__ has joined #archiveteam-bs [09:14] *** Atom-- has quit IRC (Read error: Operation timed out) [09:51] *** Mc has joined #archiveteam-bs [09:58] *** underscor has quit IRC (Read error: Operation timed out) [09:58] hi! I'm concerned about openclipart library contents : the website's frontpage has been "shut" for a few months and it contained tens of thousands (or hundred of thousands) of public domain svg files. I just realized they can still be accessed by https://openclipart.org/download// with id<318500 (318400 still works, I did not bisect further). Is there anything archive team can do to help ? [09:59] *** underscor has joined #archiveteam-bs [10:03] *** a3nm has joined #archiveteam-bs [10:07] Hi Mc, looks interesting. I started an ArchiveBot job for it. [10:09] nice! how does that work? [10:09] (I'm a friend/colleague of Mc and also interested in the openclipart archiving question) [10:10] See https://www.archiveteam.org/index.php?title=ArchiveBot [10:10] You can track progress at http://dashboard.at.ninjawedding.org/ job 3rz6ut02mt5m7eyo32n2ojxv8. [10:10] April 17th is the last snapshot with a "normal" front page [10:11] JAA: ok so here you're not "crawling" but retrieving the pages sequentially [10:12] a3nm: Correct, I created a list of the 318500 download URLs (https://transfer.notkiska.pw/11IEZX/openclipart.org-downloads) and fed that into AB. [10:13] awesome! [10:13] hmm, I don't understand why we are getting 301 on some URLs and 200 on others [10:13] https://web.archive.org/web/20190301005023/https://openclipart.org/developers [10:13] Looks like inexistent IDs, e.g. https://openclipart.org/download/13/ , redirect to https://openclipart.org/download/177067/openclipart-horizontal.svg . That redirect isn't followed, but the SVG will be retrieved on ID 177067. [10:13] as far as I can tell, all these URLs redirect to a long URL that includes the file name [10:14] JAA: I don't think it's just that [10:14] https://openclipart.org/download/974 redirects to https://openclipart.org/download/974/jean-victor-balin-unknown-green.svg (and it exists) [10:14] Yeah, so it fetches the redirect for every ID, but if the redirect target URL is in a different directory, it doesn't retrieve it due to how ArchiveBot works. [10:14] but it's listed as a 301 in the output [10:15] There are two lines for 974, one is the 301 redirect, and one is the actual SVG. [10:15] The dashboard always lists the initial URL, not the post-redirect one. [10:15] but https://openclipart.org/download/975 is listed as 200 OK, even though it's apparently exactly the same [10:15] ahh OK there are two lines for each [10:15] I hadn't understood it as they are not consecutive [10:15] Yeah, except the ones that don't exist. [10:15] *** underscor has quit IRC (Read error: Connection reset by peer) [10:15] so I guess that's fine [10:16] Mc: did you look into the archival of the page with the metadata info? [10:16] *the pages [10:16] nope [10:16] *** underscor has joined #archiveteam-bs [10:16] they seem lost [10:17] ah these pages don't work? [10:17] https://openclipart.org/detail/316831/ show the landing page [10:17] (versus https://web.archive.org/web/20190315015542/https://openclipart.org/detail/316831/boy-and-girl-in-love-line-art ) [10:17] yeah that's annoying [10:18] it would be interesting to keep the title, description, keywords, creator, date... [10:18] Looks like each clipart is also available as a PDF under https://openclipart.org/pdf/$ID/ , but I guess we don't need to grab that if we have the SVG. [10:18] ... unless that PDF has additional metadata [10:18] apparently not [10:19] There are also URLs like https://openclipart.org/people/rejon/openclipart-horizontal.svg but we can't guess those. [10:19] Mc: yeah, apparently not [10:19] (no keywords in the pdf info, just "created by cairo" which hints to a simple commandline creation from the svg) [10:19] JAA: there's the option of crawling the URLs that Wayback Machine knows about, but it's only a small fraction (like 10k or so) [10:19] Yeah, might even get converted (and cached) on demand. [10:21] Wow, WMF is also available: http://www.openclipart.org/wmf/177067/ [10:22] a3nm: If the WBM already has them, the added value of grabbing them again is anyway limited. We could scrape search engines, Reddit, etc. though. [10:23] yeah I was more thinking of using the WBM to find URLs to crawl to retrieve stuff the WBM may not have [10:23] ah, but scraping third-party sources is a nice idea [10:23] that said, we don't expect to find metadata info there, do we? [10:23] Yeah, but that doesn't work in this case because the /people/$USERNAME/ pages don't exist anymore/show the shutdown notice only. [10:23] (this could give us URLs, but what we're missing is the metadata that used to be available at these URLs) [10:23] yeah exactly [10:24] JAA: what is job c3bv2ifjrwdwmk992cchlgm72 doing ? [10:25] archiving the twitter account @openclipart? [10:25] Yep [10:25] nice! [10:25] ok so I don't see other stuff we could retrieve now [10:26] maybe once these jobs has completed we email openclipart to ask them what they are up to and could we please get access to the metadata pages? [10:27] the only time they answered someone many people thought it was impersonation [10:30] *** odemg has quit IRC (Leaving) [10:32] If there's a chance for getting that, would be nice, yeah. [10:55] *** deevious has joined #archiveteam-bs [11:17] *** i0npulse has quit IRC (Ping timeout: 252 seconds) [11:28] *** i0npulse has joined #archiveteam-bs [12:03] https://the-eye.eu/oldversion.com_Sept2019.torrent [12:07] *** wp494 has joined #archiveteam-bs [12:46] *** deevious has quit IRC (Quit: deevious) [13:32] *** deevious has joined #archiveteam-bs [15:13] *** tech234a has joined #archiveteam-bs [15:25] *** kiska has quit IRC (Remote host closed the connection) [15:25] *** Flashfire has quit IRC (Remote host closed the connection) [15:26] *** Flashfire has joined #archiveteam-bs [15:26] *** kiska has joined #archiveteam-bs [15:26] *** Fusl____ sets mode: +o kiska [15:26] *** Fusl sets mode: +o kiska [15:26] *** Fusl_ sets mode: +o kiska [15:26] coincidence that ft.com published an article on IA yesterday? :D https://www.ft.com/content/5be1f2ee-d60b-11e9-a0bd-ab8ec6435630 [15:30] I like the copyright attribution on the screenshot of their own website. [15:36] *** godane has quit IRC (Ping timeout: 252 seconds) [15:40] I wonder if those captures came from IA direct or if we could see the about, here :p [15:58] *** godane has joined #archiveteam-bs [16:26] *** schbirid has joined #archiveteam-bs [17:23] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [17:48] *** i0npulse has quit IRC (Ping timeout: 252 seconds) [18:00] *** i0npulse has joined #archiveteam-bs [18:57] would it be possible to get a crawl of https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/? has old versions of drivers [19:01] 5utqyjb1oorhxhouv48bhbfxx [19:01] Is the job id from archivebot paul2520 [19:01] http://dashboard.at.ninjawedding.org/?showNicks=1 [19:01] For status. [19:04] thanks Igloo [19:04] really appreciate it [19:06] You're welcome :) [19:12] meh, ft.com started serving 403s to me [19:15] Yup, same with the AB job. [19:17] also those pages were insane, ~300 kb baseline for an article... [19:28] *** Jens has quit IRC (Remote host closed the connection) [19:28] *** Jens has joined #archiveteam-bs [19:50] *** SynMonger has quit IRC (Quit: Wait, what?) [19:51] *** SynMonger has joined #archiveteam-bs [19:53] *** SynMonger has quit IRC (Client Quit) [19:54] *** SynMonger has joined #archiveteam-bs [19:56] *** SynMonger has quit IRC (Client Quit) [19:57] *** SynMonger has joined #archiveteam-bs [19:59] *** SynMonger has quit IRC (Client Quit) [20:00] *** SynMonger has joined #archiveteam-bs [20:01] *** Ravenloft has joined #archiveteam-bs [21:05] *** wp494 has quit IRC (Ping timeout: 745 seconds) [21:05] *** wp494 has joined #archiveteam-bs [21:32] *** tech234a has joined #archiveteam-bs [21:34] *** DogsRNice has joined #archiveteam-bs [22:05] Is "there are currently more than 60tn web pages" a typo? [22:05] In regards to? [22:06] From the FT article. suspect the fat-finger TB, but then the grammar does not work. [22:08] "More than 60 thousand web pages" would be technicallay correct territory. [22:18] tn abbreviation for: trillion [22:19] that's a ton of webpages. [22:20] Thanks Raccoon [22:20] I guess it depends what they class a "web page" as [22:21] There are more than 60 trillion pages. [22:21] I would assume facebook has billions of user / group pages alone [22:21] 99% of webpages are never visited by a human. [22:21] Twitter profile pages? etc [22:21] Again definition of a webpage? An API is a webpage. Technically. [22:28] *** Stiletto has quit IRC (Ping timeout: 258 seconds) [22:28] *** Stilett0 has joined #archiveteam-bs [22:30] *** Stilett0 is now known as Stiletto [22:33] *** HashbangI has quit IRC (Read error: Connection reset by peer) [22:38] *** HashbangI has joined #archiveteam-bs [22:43] *** BlueMax has joined #archiveteam-bs [22:53] *** luckcolor has quit IRC (Remote host closed the connection) [22:54] *** luckcolor has joined #archiveteam-bs [23:38] *** asdf0101 has quit IRC (The Lounge - https://thelounge.chat) [23:38] *** markedL has quit IRC (Quit: The Lounge - https://thelounge.chat) [23:39] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [23:39] *** asdf0101 has joined #archiveteam-bs [23:39] *** markedL has joined #archiveteam-bs [23:48] *** RichardG has quit IRC (Read error: Connection reset by peer) [23:48] *** RichardG has joined #archiveteam-bs