#archiveteam-ot 2019-11-08,Fri

↑back Search

Time Nickname Message
00:03 🔗 schbirid has quit IRC (Quit: Leaving)
00:10 🔗 apache2 has joined #archiveteam-ot
00:37 🔗 robogoat_ has quit IRC (Ping timeout: 258 seconds)
00:37 🔗 robogoat has joined #archiveteam-ot
01:03 🔗 ivan Mateon1: I don't get it, why does node need that much stuff in memory for crawling annotations :-)
01:04 🔗 ivan how was it stored on disk?
01:26 🔗 yawkat has quit IRC (Ping timeout: 246 seconds)
01:35 🔗 yawkat has joined #archiveteam-ot
02:30 🔗 Raccoon yano: This would interest you. It's about the relationship of public libraries and ebooks, and new lending restrictions on the horizon. https://www.eff.org/deeplinks/2019/11/publishers-should-be-making-e-book-licensing-better-not-worse
02:31 🔗 Raccoon Publishers are trying to erase library expectations of 'first sale doctrine' behavior / use in the e-book realm.
02:48 🔗 yano Raccoon: yeah, i saw that :-\
02:50 🔗 Raccoon Really waiting for them to overstep, like Disney ending licensing of old films from the Fox catalog, like The Day The Earth Stood Still.
02:51 🔗 Raccoon That act should immediately convert the copyright to public domain
02:53 🔗 ShellyRol has quit IRC (Read error: Connection reset by peer)
02:55 🔗 ShellyRol has joined #archiveteam-ot
03:14 🔗 kiskabak has quit IRC (Ping timeout (120 seconds))
03:15 🔗 kiskabak has joined #archiveteam-ot
03:15 🔗 Fusl sets mode: +o kiskabak
03:15 🔗 Fusl__ sets mode: +o kiskabak
03:15 🔗 Fusl_ sets mode: +o kiskabak
03:39 🔗 m007a83 has joined #archiveteam-ot
03:46 🔗 manjaro-u has quit IRC (Read error: Operation timed out)
03:53 🔗 BlueMax has joined #archiveteam-ot
04:39 🔗 manjaro-u has joined #archiveteam-ot
04:40 🔗 qw3rty has joined #archiveteam-ot
04:49 🔗 qw3rty2 has quit IRC (Ping timeout: 745 seconds)
04:51 🔗 manjaro-u has quit IRC (Quit: Konversation terminated!)
05:38 🔗 manjaro-u has joined #archiveteam-ot
05:50 🔗 nataraj has joined #archiveteam-ot
05:57 🔗 manjaro-u has quit IRC (Quit: Konversation terminated!)
06:00 🔗 nataraj has quit IRC (Read error: Operation timed out)
06:05 🔗 nataraj has joined #archiveteam-ot
06:17 🔗 Mateon1 ivan: Sorry, I missed that as I was asleep already. I need a fast way to check whether a page was visited before queuing that page. At first I just stored the video IDs, playlist IDs and user IDs as strings in a Set(), then when I ran into GC issues I decoded the string IDs into a 64-bit integer, and stored that.
06:19 🔗 Mateon1 For the annotation crawler most of this was done server side, and there was only a small set of cached IDs not to recrawl. With this one, there is no server, I'm just dumping lines of text into a TSV file in append mode for later processing.
06:19 🔗 manjaro-u has joined #archiveteam-ot
06:20 🔗 ivan Mateon1: ah, I guess rocksdb might be a better place to store such a set
06:24 🔗 nataraj has quit IRC (Quit: Konversation terminated!)
06:32 🔗 Mateon1 I just took a look, but I have no idea how to use that from an application, the C++ bindings are quite a mess, and I'd prefer to avoid Java if possible. I need to rethink the problem... and I'll probably end up reinventing the database, or something
06:32 🔗 Ryz I just attempted to replicate the thing that Raccoon just to make sure it's not only 'em; yep, with uBlock being active, I was unable to download https://cdn4.vectorstock.com/i/1000x1000/00/93/of-thief-vector-23180093.jpg (but, I can still download it even with that supposed block by dragging the image from the window into the desktop),
06:32 🔗 Ryz With uBlock disabled, I was able to download the image normally
06:32 🔗 dhyan_nat has joined #archiveteam-ot
06:34 🔗 Raccoon Neat, I didn't know Chrome did drag-drop to the desktop.
06:37 🔗 Ryz Yeah, I usually save images because I'm too lazy to right-click the image, click "Save image as...", and have to either browse for a folder for the image to save in, or just skip it and have it set as the default download and confirm with "Save"
06:39 🔗 Ryz Speaking of drag and drop, there's one thing that Firefox has that Google Chrome appears to never implement (without an extension, being Tab-Snap) most likely due to what's basically 'dummy-proofing' in the eyes of the developers, being able to open multiple links from dragging text onto the window
06:40 🔗 Ryz I can drag one link onto Google Chrome, but not multiple at all
06:50 🔗 kiska has quit IRC (Remote host closed the connection)
06:50 🔗 Flashfire has quit IRC (Remote host closed the connection)
06:51 🔗 kiska has joined #archiveteam-ot
06:51 🔗 Flashfire has joined #archiveteam-ot
06:51 🔗 Fusl__ sets mode: +o kiska
06:51 🔗 Fusl sets mode: +o kiska
06:51 🔗 Fusl_ sets mode: +o kiska
07:42 🔗 ivan Mateon1: people have bindings to it for node of course
07:42 🔗 ivan and Rust, and a bunch of others
07:56 🔗 ivan compared to using postgresql you get compression and more transactions per second while losing multi-user and interesting querying capabilities, so it can make sense for applications where you have a single process touching it
08:11 🔗 Mateon1 I'll look into it, thanks for the suggestion. I'll have free time for experimenting on the weekend.
09:10 🔗 Raccoon has quit IRC (Ping timeout: 612 seconds)
09:17 🔗 lunik1 has quit IRC (Read error: Connection reset by peer)
09:17 🔗 lunik1 has joined #archiveteam-ot
09:25 🔗 Raccoon has joined #archiveteam-ot
09:48 🔗 manjaro-u has quit IRC (Quit: Konversation terminated!)
10:00 🔗 manjaro-u has joined #archiveteam-ot
10:06 🔗 HP_Archiv has quit IRC (Ping timeout: 263 seconds)
10:14 🔗 manjaro-u has quit IRC (Quit: Konversation terminated!)
10:55 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
11:14 🔗 IAmbience has quit IRC (Quit: Connection closed for inactivity)
11:26 🔗 Tenebrae has quit IRC (Read error: Operation timed out)
11:41 🔗 Tenebrae has joined #archiveteam-ot
12:06 🔗 tuluu_ has joined #archiveteam-ot
12:06 🔗 tuluu has quit IRC (Read error: Connection reset by peer)
12:18 🔗 martini has joined #archiveteam-ot
12:45 🔗 IAmbience has joined #archiveteam-ot
12:53 🔗 hata has joined #archiveteam-ot
13:21 🔗 deevious has quit IRC (Read error: Connection reset by peer)
13:22 🔗 deevious has joined #archiveteam-ot
13:28 🔗 deevious has quit IRC (Ping timeout: 252 seconds)
14:31 🔗 systwi_ is now known as systwi
14:44 🔗 deevious has joined #archiveteam-ot
15:14 🔗 manjaro-u has joined #archiveteam-ot
15:54 🔗 martini has quit IRC (Read error: Connection reset by peer)
15:55 🔗 martini has joined #archiveteam-ot
16:05 🔗 yano neat, https://archivebox.io/
16:30 🔗 X-Scale has quit IRC (Ping timeout: 252 seconds)
16:31 🔗 [X-Scale] has joined #archiveteam-ot
16:31 🔗 [X-Scale] is now known as X-Scale
16:32 🔗 deevious has quit IRC (Ping timeout: 252 seconds)
16:36 🔗 manjaro-u has quit IRC (Konversation terminated!)
16:42 🔗 Video has joined #archiveteam-ot
16:43 🔗 Video i'm guessing i can use this channel to talk about general archiving
16:43 🔗 astrid yep
16:44 🔗 Video perfecrt
16:44 🔗 Video perfect
16:45 🔗 Video i'm going to leave this here (as i'm archiving microsoft's support pages api): microsoft's support pages use an api that responds with JSON. if something isn't found, it just gives a 404 not found HTTP header. it blocks wget's default useragent with a 405 Method Not Allowed header (from what i recall).
16:45 🔗 Video here's a template url: https://support.microsoft.com/app/content/api/content/help/<localization>/<id>
16:45 🔗 Video ids in urls are numerically incremented and require a valid localization to work. an example of a url would be https://support.microsoft.com/app/content/api/content/help/en-us/931699
16:46 🔗 yano got it; so for archive.today, if anyone cares, `archive.fo` is going away, which coincidentally, if you use another domain, you don't get the captcha
16:47 🔗 JAA Video: Right, I noticed that as well when I briefly looked into it yesterday. Any idea how high the IDs go?
16:47 🔗 manjaro-u has joined #archiveteam-ot
16:47 🔗 Video I don't however I'm currently testing those waters myself
16:47 🔗 JAA I also noticed that some help articles use UUIDs instead of numerical IDs.
16:48 🔗 JAA Specifically, newer ones it seems.
16:49 🔗 Video can you give an example
16:49 🔗 JAA Yeah, a lot of Office stuff for example: https://support.office.com/en-us/article/download-and-install-or-reinstall-office-365-or-office-2019-on-a-pc-or-mac-4414eaaf-0478-48be-9c42-23adc4716658?ui=en-US&rs=en-US&ad=US
16:49 🔗 JAA It doesn't even query the API it seems.
16:51 🔗 Video oh god
16:51 🔗 Video it even includes the uuid in the name
16:54 🔗 Video I can't even find it in Microsoft's RSS feeds for office 365
16:57 🔗 Video i did find some feeds for office XP though
16:57 🔗 Video https://support.microsoft.com/app/content/api/content/feeds/sap/en-us/ac9e378c-db19-ecd1-6fdd-b94a2fae7264/rss
17:00 🔗 Video the highest ID i've gotten so far is 20191008
17:04 🔗 Video but the amount could get bigger overtime
17:07 🔗 Video i've got 4 dedicated instances of wget running for 4 different ranges
17:07 🔗 Video and i've gotten about 100 MB worth of data so far
17:09 🔗 Video https://video.doesnt-have-a.life/oCQsyrgbdqKP.png
17:10 🔗 schbirid has joined #archiveteam-ot
17:15 🔗 manjaro-u has quit IRC (Konversation terminated!)
17:20 🔗 martini2 has joined #archiveteam-ot
17:25 🔗 martini has quit IRC (Read error: Operation timed out)
17:37 🔗 akierig has joined #archiveteam-ot
18:11 🔗 tuluu_ has quit IRC (Read error: Connection refused)
18:12 🔗 tuluu has joined #archiveteam-ot
18:15 🔗 bluefoo has quit IRC (Ping timeout: 255 seconds)
18:18 🔗 lunik1 has quit IRC (Read error: Operation timed out)
18:21 🔗 lunik1 has joined #archiveteam-ot
18:23 🔗 Video has quit IRC (Quit: Page closed)
18:25 🔗 manjaro-u has joined #archiveteam-ot
18:39 🔗 DogsRNice has joined #archiveteam-ot
19:23 🔗 akierig has quit IRC (Quit: later_gator)
19:31 🔗 bluefoo has joined #archiveteam-ot
19:45 🔗 Raccoon JAA: harddrive question; your opinion, what is a good / appropriate "allocation unit" size for the 5TB Western Digital 2.5" external usb HDD, with respect to attempt at closely matching the hardware characteristics of phsycial writes, reads, caching, etc?
19:46 🔗 Raccoon If I had a good modern machine, I would just benchmark at different sizes
19:47 🔗 Raccoon (suss out which settings cause twice the physical labor)
21:36 🔗 BlueMax has joined #archiveteam-ot
21:37 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
22:09 🔗 martini2 has quit IRC (Quit: No Reasson)
22:17 🔗 schbirid a what?
22:18 🔗 schbirid has quit IRC (Quit: Leaving)
22:27 🔗 lunik1 has quit IRC (Read error: Connection reset by peer)
22:27 🔗 lunik1 has joined #archiveteam-ot
23:03 🔗 lunik1 has quit IRC (Read error: Connection reset by peer)
23:37 🔗 dd33cc has joined #archiveteam-ot

irclogger-viewer