#archiveteam-bs 2019-09-18,Wed

↑back Search

Time Nickname Message
00:00 🔗 JAA It's also possible that all of this will go away when I move away from keeping everything in memory. That's also a significant refactor though.
00:00 🔗 britmob has joined #archiveteam-bs
00:19 🔗 godane has quit IRC (Ping timeout: 252 seconds)
01:03 🔗 freemint has quit IRC (Ping timeout: 252 seconds)
01:07 🔗 freemint has joined #archiveteam-bs
01:55 🔗 freemint has quit IRC (Ping timeout: 252 seconds)
02:02 🔗 freemint has joined #archiveteam-bs
02:03 🔗 tech234a has quit IRC (Quit: Connection closed for inactivity)
02:07 🔗 freemint has quit IRC (Remote host closed the connection)
02:07 🔗 freemint has joined #archiveteam-bs
02:29 🔗 DogsRNice has quit IRC (Read error: Connection reset by peer)
03:00 🔗 RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue)
03:12 🔗 freemint has quit IRC (Ping timeout: 252 seconds)
03:23 🔗 freemint has joined #archiveteam-bs
03:36 🔗 odemgi has joined #archiveteam-bs
03:37 🔗 odemg has quit IRC (Read error: Operation timed out)
03:40 🔗 RichardG has joined #archiveteam-bs
03:42 🔗 odemgi_ has quit IRC (Read error: Operation timed out)
03:45 🔗 qw3rty2 has joined #archiveteam-bs
03:51 🔗 qw3rty has quit IRC (Ping timeout: 745 seconds)
03:51 🔗 odemg has joined #archiveteam-bs
03:51 🔗 freemint has quit IRC (Remote host closed the connection)
05:15 🔗 wp494 has joined #archiveteam-bs
05:44 🔗 SketchCow archivebot thumbnailer has now been shifted to a "make thumbnails in reverse number of views order"
05:44 🔗 SketchCow Mostly means the archivebot collection will get thumbnails backfilling the ones without them starting from the top, going down. Going to just let that keep running to the end of the year.
05:52 🔗 godane has joined #archiveteam-bs
07:01 🔗 william has joined #archiveteam-bs
07:28 🔗 VADemon has quit IRC (Read error: Connection reset by peer)
07:29 🔗 VADemon has joined #archiveteam-bs
07:30 🔗 slyphic has quit IRC (Read error: Operation timed out)
07:30 🔗 slyphic has joined #archiveteam-bs
07:34 🔗 Mateon1 has quit IRC (Ping timeout: 255 seconds)
07:35 🔗 Mateon1 has joined #archiveteam-bs
07:37 🔗 jspiros has joined #archiveteam-bs
07:40 🔗 HashbangI has joined #archiveteam-bs
07:56 🔗 wp494 has quit IRC (Read error: Operation timed out)
07:56 🔗 william has quit IRC (Remote host closed the connection)
07:58 🔗 slyphic has quit IRC (Read error: Operation timed out)
07:58 🔗 slyphic has joined #archiveteam-bs
08:04 🔗 VADemon_ has joined #archiveteam-bs
08:04 🔗 super3_ has quit IRC (Read error: Operation timed out)
08:05 🔗 super3_ has joined #archiveteam-bs
08:05 🔗 omglolbah has quit IRC (Read error: Operation timed out)
08:06 🔗 omglolba^ has quit IRC (Read error: Operation timed out)
08:08 🔗 omglolbah has joined #archiveteam-bs
08:09 🔗 VADemon has quit IRC (Read error: Operation timed out)
08:14 🔗 omglolba- has joined #archiveteam-bs
08:47 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
09:08 🔗 Atom__ has joined #archiveteam-bs
09:14 🔗 Atom-- has quit IRC (Read error: Operation timed out)
09:51 🔗 Mc has joined #archiveteam-bs
09:58 🔗 underscor has quit IRC (Read error: Operation timed out)
09:58 🔗 Mc hi! I'm concerned about openclipart library contents : the website's frontpage has been "shut" for a few months and it contained tens of thousands (or hundred of thousands) of public domain svg files. I just realized they can still be accessed by https://openclipart.org/download/<id>/ with id<318500 (318400 still works, I did not bisect further). Is there anything archive team can do to help ?
09:59 🔗 underscor has joined #archiveteam-bs
10:03 🔗 a3nm has joined #archiveteam-bs
10:07 🔗 JAA Hi Mc, looks interesting. I started an ArchiveBot job for it.
10:09 🔗 a3nm nice! how does that work?
10:09 🔗 a3nm (I'm a friend/colleague of Mc and also interested in the openclipart archiving question)
10:10 🔗 JAA See https://www.archiveteam.org/index.php?title=ArchiveBot
10:10 🔗 JAA You can track progress at http://dashboard.at.ninjawedding.org/ job 3rz6ut02mt5m7eyo32n2ojxv8.
10:10 🔗 markedL April 17th is the last snapshot with a "normal" front page
10:11 🔗 a3nm JAA: ok so here you're not "crawling" but retrieving the pages sequentially
10:12 🔗 JAA a3nm: Correct, I created a list of the 318500 download URLs (https://transfer.notkiska.pw/11IEZX/openclipart.org-downloads) and fed that into AB.
10:13 🔗 a3nm awesome!
10:13 🔗 a3nm hmm, I don't understand why we are getting 301 on some URLs and 200 on others
10:13 🔗 markedL https://web.archive.org/web/20190301005023/https://openclipart.org/developers
10:13 🔗 JAA Looks like inexistent IDs, e.g. https://openclipart.org/download/13/ , redirect to https://openclipart.org/download/177067/openclipart-horizontal.svg . That redirect isn't followed, but the SVG will be retrieved on ID 177067.
10:13 🔗 a3nm as far as I can tell, all these URLs redirect to a long URL that includes the file name
10:14 🔗 a3nm JAA: I don't think it's just that
10:14 🔗 a3nm https://openclipart.org/download/974 redirects to https://openclipart.org/download/974/jean-victor-balin-unknown-green.svg (and it exists)
10:14 🔗 JAA Yeah, so it fetches the redirect for every ID, but if the redirect target URL is in a different directory, it doesn't retrieve it due to how ArchiveBot works.
10:14 🔗 a3nm but it's listed as a 301 in the output
10:15 🔗 JAA There are two lines for 974, one is the 301 redirect, and one is the actual SVG.
10:15 🔗 JAA The dashboard always lists the initial URL, not the post-redirect one.
10:15 🔗 a3nm but https://openclipart.org/download/975 is listed as 200 OK, even though it's apparently exactly the same
10:15 🔗 a3nm ahh OK there are two lines for each
10:15 🔗 a3nm I hadn't understood it as they are not consecutive
10:15 🔗 JAA Yeah, except the ones that don't exist.
10:15 🔗 underscor has quit IRC (Read error: Connection reset by peer)
10:15 🔗 a3nm so I guess that's fine
10:16 🔗 a3nm Mc: did you look into the archival of the page with the metadata info?
10:16 🔗 a3nm *the pages
10:16 🔗 Mc nope
10:16 🔗 underscor has joined #archiveteam-bs
10:16 🔗 Mc they seem lost
10:17 🔗 a3nm ah these pages don't work?
10:17 🔗 Mc https://openclipart.org/detail/316831/ show the landing page
10:17 🔗 Mc (versus https://web.archive.org/web/20190315015542/https://openclipart.org/detail/316831/boy-and-girl-in-love-line-art )
10:17 🔗 a3nm yeah that's annoying
10:18 🔗 a3nm it would be interesting to keep the title, description, keywords, creator, date...
10:18 🔗 JAA Looks like each clipart is also available as a PDF under https://openclipart.org/pdf/$ID/ , but I guess we don't need to grab that if we have the SVG.
10:18 🔗 a3nm ... unless that PDF has additional metadata
10:18 🔗 Mc apparently not
10:19 🔗 JAA There are also URLs like https://openclipart.org/people/rejon/openclipart-horizontal.svg but we can't guess those.
10:19 🔗 a3nm Mc: yeah, apparently not
10:19 🔗 Mc (no keywords in the pdf info, just "created by cairo" which hints to a simple commandline creation from the svg)
10:19 🔗 a3nm JAA: there's the option of crawling the URLs that Wayback Machine knows about, but it's only a small fraction (like 10k or so)
10:19 🔗 JAA Yeah, might even get converted (and cached) on demand.
10:21 🔗 JAA Wow, WMF is also available: http://www.openclipart.org/wmf/177067/
10:22 🔗 JAA a3nm: If the WBM already has them, the added value of grabbing them again is anyway limited. We could scrape search engines, Reddit, etc. though.
10:23 🔗 a3nm yeah I was more thinking of using the WBM to find URLs to crawl to retrieve stuff the WBM may not have
10:23 🔗 a3nm ah, but scraping third-party sources is a nice idea
10:23 🔗 a3nm that said, we don't expect to find metadata info there, do we?
10:23 🔗 JAA Yeah, but that doesn't work in this case because the /people/$USERNAME/ pages don't exist anymore/show the shutdown notice only.
10:23 🔗 a3nm (this could give us URLs, but what we're missing is the metadata that used to be available at these URLs)
10:23 🔗 a3nm yeah exactly
10:24 🔗 a3nm JAA: what is job c3bv2ifjrwdwmk992cchlgm72 doing ?
10:25 🔗 a3nm archiving the twitter account @openclipart?
10:25 🔗 JAA Yep
10:25 🔗 a3nm nice!
10:25 🔗 a3nm ok so I don't see other stuff we could retrieve now
10:26 🔗 a3nm maybe once these jobs has completed we email openclipart to ask them what they are up to and could we please get access to the metadata pages?
10:27 🔗 Mc the only time they answered someone many people thought it was impersonation
10:30 🔗 odemg has quit IRC (Leaving)
10:32 🔗 JAA If there's a chance for getting that, would be nice, yeah.
10:55 🔗 deevious has joined #archiveteam-bs
11:17 🔗 i0npulse has quit IRC (Ping timeout: 252 seconds)
11:28 🔗 i0npulse has joined #archiveteam-bs
12:03 🔗 odemgi https://the-eye.eu/oldversion.com_Sept2019.torrent
12:07 🔗 wp494 has joined #archiveteam-bs
12:46 🔗 deevious has quit IRC (Quit: deevious)
13:32 🔗 deevious has joined #archiveteam-bs
15:13 🔗 tech234a has joined #archiveteam-bs
15:25 🔗 kiska has quit IRC (Remote host closed the connection)
15:25 🔗 Flashfire has quit IRC (Remote host closed the connection)
15:26 🔗 Flashfire has joined #archiveteam-bs
15:26 🔗 kiska has joined #archiveteam-bs
15:26 🔗 Fusl____ sets mode: +o kiska
15:26 🔗 Fusl sets mode: +o kiska
15:26 🔗 Fusl_ sets mode: +o kiska
15:26 🔗 Stiletto coincidence that ft.com published an article on IA yesterday? :D https://www.ft.com/content/5be1f2ee-d60b-11e9-a0bd-ab8ec6435630
15:30 🔗 JAA I like the copyright attribution on the screenshot of their own website.
15:36 🔗 godane has quit IRC (Ping timeout: 252 seconds)
15:40 🔗 Igloo I wonder if those captures came from IA direct or if we could see the about, here :p
15:58 🔗 godane has joined #archiveteam-bs
16:26 🔗 schbirid has joined #archiveteam-bs
17:23 🔗 tech234a has quit IRC (Quit: Connection closed for inactivity)
17:48 🔗 i0npulse has quit IRC (Ping timeout: 252 seconds)
18:00 🔗 i0npulse has joined #archiveteam-bs
18:57 🔗 paul2520 would it be possible to get a crawl of https://repo1.maven.org/maven2/net/snowflake/snowflake-jdbc/? has old versions of drivers
19:01 🔗 Igloo 5utqyjb1oorhxhouv48bhbfxx
19:01 🔗 Igloo Is the job id from archivebot paul2520
19:01 🔗 Igloo http://dashboard.at.ninjawedding.org/?showNicks=1
19:01 🔗 Igloo For status.
19:04 🔗 paul2520 thanks Igloo
19:04 🔗 paul2520 really appreciate it
19:06 🔗 Igloo You're welcome :)
19:12 🔗 schbirid meh, ft.com started serving 403s to me
19:15 🔗 JAA Yup, same with the AB job.
19:17 🔗 schbirid also those pages were insane, ~300 kb baseline for an article...
19:28 🔗 Jens has quit IRC (Remote host closed the connection)
19:28 🔗 Jens has joined #archiveteam-bs
19:50 🔗 SynMonger has quit IRC (Quit: Wait, what?)
19:51 🔗 SynMonger has joined #archiveteam-bs
19:53 🔗 SynMonger has quit IRC (Client Quit)
19:54 🔗 SynMonger has joined #archiveteam-bs
19:56 🔗 SynMonger has quit IRC (Client Quit)
19:57 🔗 SynMonger has joined #archiveteam-bs
19:59 🔗 SynMonger has quit IRC (Client Quit)
20:00 🔗 SynMonger has joined #archiveteam-bs
20:01 🔗 Ravenloft has joined #archiveteam-bs
21:05 🔗 wp494 has quit IRC (Ping timeout: 745 seconds)
21:05 🔗 wp494 has joined #archiveteam-bs
21:32 🔗 tech234a has joined #archiveteam-bs
21:34 🔗 DogsRNice has joined #archiveteam-bs
22:05 🔗 phillipsj Is "there are currently more than 60tn web pages" a typo?
22:05 🔗 Igloo In regards to?
22:06 🔗 phillipsj From the FT article. suspect the fat-finger TB, but then the grammar does not work.
22:08 🔗 phillipsj "More than 60 thousand web pages" would be technicallay correct territory.
22:18 🔗 Raccoon tn abbreviation for: trillion
22:19 🔗 Raccoon that's a ton of webpages.
22:20 🔗 phillipsj Thanks Raccoon
22:20 🔗 Igloo I guess it depends what they class a "web page" as
22:21 🔗 Igloo There are more than 60 trillion pages.
22:21 🔗 Igloo I would assume facebook has billions of user / group pages alone
22:21 🔗 Raccoon 99% of webpages are never visited by a human.
22:21 🔗 Igloo Twitter profile pages? etc
22:21 🔗 Igloo Again definition of a webpage? An API is a webpage. Technically.
22:28 🔗 Stiletto has quit IRC (Ping timeout: 258 seconds)
22:28 🔗 Stilett0 has joined #archiveteam-bs
22:30 🔗 Stilett0 is now known as Stiletto
22:33 🔗 HashbangI has quit IRC (Read error: Connection reset by peer)
22:38 🔗 HashbangI has joined #archiveteam-bs
22:43 🔗 BlueMax has joined #archiveteam-bs
22:53 🔗 luckcolor has quit IRC (Remote host closed the connection)
22:54 🔗 luckcolor has joined #archiveteam-bs
23:38 🔗 asdf0101 has quit IRC (The Lounge - https://thelounge.chat)
23:38 🔗 markedL has quit IRC (Quit: The Lounge - https://thelounge.chat)
23:39 🔗 tech234a has quit IRC (Quit: Connection closed for inactivity)
23:39 🔗 asdf0101 has joined #archiveteam-bs
23:39 🔗 markedL has joined #archiveteam-bs
23:48 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
23:48 🔗 RichardG has joined #archiveteam-bs

irclogger-viewer