Time |
Nickname |
Message |
00:03
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
00:10
🔗
|
|
apache2 has joined #archiveteam-ot |
00:37
🔗
|
|
robogoat_ has quit IRC (Ping timeout: 258 seconds) |
00:37
🔗
|
|
robogoat has joined #archiveteam-ot |
01:03
🔗
|
ivan |
Mateon1: I don't get it, why does node need that much stuff in memory for crawling annotations :-) |
01:04
🔗
|
ivan |
how was it stored on disk? |
01:26
🔗
|
|
yawkat has quit IRC (Ping timeout: 246 seconds) |
01:35
🔗
|
|
yawkat has joined #archiveteam-ot |
02:30
🔗
|
Raccoon |
yano: This would interest you. It's about the relationship of public libraries and ebooks, and new lending restrictions on the horizon. https://www.eff.org/deeplinks/2019/11/publishers-should-be-making-e-book-licensing-better-not-worse |
02:31
🔗
|
Raccoon |
Publishers are trying to erase library expectations of 'first sale doctrine' behavior / use in the e-book realm. |
02:48
🔗
|
yano |
Raccoon: yeah, i saw that :-\ |
02:50
🔗
|
Raccoon |
Really waiting for them to overstep, like Disney ending licensing of old films from the Fox catalog, like The Day The Earth Stood Still. |
02:51
🔗
|
Raccoon |
That act should immediately convert the copyright to public domain |
02:53
🔗
|
|
ShellyRol has quit IRC (Read error: Connection reset by peer) |
02:55
🔗
|
|
ShellyRol has joined #archiveteam-ot |
03:14
🔗
|
|
kiskabak has quit IRC (Ping timeout (120 seconds)) |
03:15
🔗
|
|
kiskabak has joined #archiveteam-ot |
03:15
🔗
|
|
Fusl sets mode: +o kiskabak |
03:15
🔗
|
|
Fusl__ sets mode: +o kiskabak |
03:15
🔗
|
|
Fusl_ sets mode: +o kiskabak |
03:39
🔗
|
|
m007a83 has joined #archiveteam-ot |
03:46
🔗
|
|
manjaro-u has quit IRC (Read error: Operation timed out) |
03:53
🔗
|
|
BlueMax has joined #archiveteam-ot |
04:39
🔗
|
|
manjaro-u has joined #archiveteam-ot |
04:40
🔗
|
|
qw3rty has joined #archiveteam-ot |
04:49
🔗
|
|
qw3rty2 has quit IRC (Ping timeout: 745 seconds) |
04:51
🔗
|
|
manjaro-u has quit IRC (Quit: Konversation terminated!) |
05:38
🔗
|
|
manjaro-u has joined #archiveteam-ot |
05:50
🔗
|
|
nataraj has joined #archiveteam-ot |
05:57
🔗
|
|
manjaro-u has quit IRC (Quit: Konversation terminated!) |
06:00
🔗
|
|
nataraj has quit IRC (Read error: Operation timed out) |
06:05
🔗
|
|
nataraj has joined #archiveteam-ot |
06:17
🔗
|
Mateon1 |
ivan: Sorry, I missed that as I was asleep already. I need a fast way to check whether a page was visited before queuing that page. At first I just stored the video IDs, playlist IDs and user IDs as strings in a Set(), then when I ran into GC issues I decoded the string IDs into a 64-bit integer, and stored that. |
06:19
🔗
|
Mateon1 |
For the annotation crawler most of this was done server side, and there was only a small set of cached IDs not to recrawl. With this one, there is no server, I'm just dumping lines of text into a TSV file in append mode for later processing. |
06:19
🔗
|
|
manjaro-u has joined #archiveteam-ot |
06:20
🔗
|
ivan |
Mateon1: ah, I guess rocksdb might be a better place to store such a set |
06:24
🔗
|
|
nataraj has quit IRC (Quit: Konversation terminated!) |
06:32
🔗
|
Mateon1 |
I just took a look, but I have no idea how to use that from an application, the C++ bindings are quite a mess, and I'd prefer to avoid Java if possible. I need to rethink the problem... and I'll probably end up reinventing the database, or something |
06:32
🔗
|
Ryz |
I just attempted to replicate the thing that Raccoon just to make sure it's not only 'em; yep, with uBlock being active, I was unable to download https://cdn4.vectorstock.com/i/1000x1000/00/93/of-thief-vector-23180093.jpg (but, I can still download it even with that supposed block by dragging the image from the window into the desktop), |
06:32
🔗
|
Ryz |
With uBlock disabled, I was able to download the image normally |
06:32
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
06:34
🔗
|
Raccoon |
Neat, I didn't know Chrome did drag-drop to the desktop. |
06:37
🔗
|
Ryz |
Yeah, I usually save images because I'm too lazy to right-click the image, click "Save image as...", and have to either browse for a folder for the image to save in, or just skip it and have it set as the default download and confirm with "Save" |
06:39
🔗
|
Ryz |
Speaking of drag and drop, there's one thing that Firefox has that Google Chrome appears to never implement (without an extension, being Tab-Snap) most likely due to what's basically 'dummy-proofing' in the eyes of the developers, being able to open multiple links from dragging text onto the window |
06:40
🔗
|
Ryz |
I can drag one link onto Google Chrome, but not multiple at all |
06:50
🔗
|
|
kiska has quit IRC (Remote host closed the connection) |
06:50
🔗
|
|
Flashfire has quit IRC (Remote host closed the connection) |
06:51
🔗
|
|
kiska has joined #archiveteam-ot |
06:51
🔗
|
|
Flashfire has joined #archiveteam-ot |
06:51
🔗
|
|
Fusl__ sets mode: +o kiska |
06:51
🔗
|
|
Fusl sets mode: +o kiska |
06:51
🔗
|
|
Fusl_ sets mode: +o kiska |
07:42
🔗
|
ivan |
Mateon1: people have bindings to it for node of course |
07:42
🔗
|
ivan |
and Rust, and a bunch of others |
07:56
🔗
|
ivan |
compared to using postgresql you get compression and more transactions per second while losing multi-user and interesting querying capabilities, so it can make sense for applications where you have a single process touching it |
08:11
🔗
|
Mateon1 |
I'll look into it, thanks for the suggestion. I'll have free time for experimenting on the weekend. |
09:10
🔗
|
|
Raccoon has quit IRC (Ping timeout: 612 seconds) |
09:17
🔗
|
|
lunik1 has quit IRC (Read error: Connection reset by peer) |
09:17
🔗
|
|
lunik1 has joined #archiveteam-ot |
09:25
🔗
|
|
Raccoon has joined #archiveteam-ot |
09:48
🔗
|
|
manjaro-u has quit IRC (Quit: Konversation terminated!) |
10:00
🔗
|
|
manjaro-u has joined #archiveteam-ot |
10:06
🔗
|
|
HP_Archiv has quit IRC (Ping timeout: 263 seconds) |
10:14
🔗
|
|
manjaro-u has quit IRC (Quit: Konversation terminated!) |
10:55
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
11:14
🔗
|
|
IAmbience has quit IRC (Quit: Connection closed for inactivity) |
11:26
🔗
|
|
Tenebrae has quit IRC (Read error: Operation timed out) |
11:41
🔗
|
|
Tenebrae has joined #archiveteam-ot |
12:06
🔗
|
|
tuluu_ has joined #archiveteam-ot |
12:06
🔗
|
|
tuluu has quit IRC (Read error: Connection reset by peer) |
12:18
🔗
|
|
martini has joined #archiveteam-ot |
12:45
🔗
|
|
IAmbience has joined #archiveteam-ot |
12:53
🔗
|
|
hata has joined #archiveteam-ot |
13:21
🔗
|
|
deevious has quit IRC (Read error: Connection reset by peer) |
13:22
🔗
|
|
deevious has joined #archiveteam-ot |
13:28
🔗
|
|
deevious has quit IRC (Ping timeout: 252 seconds) |
14:31
🔗
|
|
systwi_ is now known as systwi |
14:44
🔗
|
|
deevious has joined #archiveteam-ot |
15:14
🔗
|
|
manjaro-u has joined #archiveteam-ot |
15:54
🔗
|
|
martini has quit IRC (Read error: Connection reset by peer) |
15:55
🔗
|
|
martini has joined #archiveteam-ot |
16:05
🔗
|
yano |
neat, https://archivebox.io/ |
16:30
🔗
|
|
X-Scale has quit IRC (Ping timeout: 252 seconds) |
16:31
🔗
|
|
[X-Scale] has joined #archiveteam-ot |
16:31
🔗
|
|
[X-Scale] is now known as X-Scale |
16:32
🔗
|
|
deevious has quit IRC (Ping timeout: 252 seconds) |
16:36
🔗
|
|
manjaro-u has quit IRC (Konversation terminated!) |
16:42
🔗
|
|
Video has joined #archiveteam-ot |
16:43
🔗
|
Video |
i'm guessing i can use this channel to talk about general archiving |
16:43
🔗
|
astrid |
yep |
16:44
🔗
|
Video |
perfecrt |
16:44
🔗
|
Video |
perfect |
16:45
🔗
|
Video |
i'm going to leave this here (as i'm archiving microsoft's support pages api): microsoft's support pages use an api that responds with JSON. if something isn't found, it just gives a 404 not found HTTP header. it blocks wget's default useragent with a 405 Method Not Allowed header (from what i recall). |
16:45
🔗
|
Video |
here's a template url: https://support.microsoft.com/app/content/api/content/help/<localization>/<id> |
16:45
🔗
|
Video |
ids in urls are numerically incremented and require a valid localization to work. an example of a url would be https://support.microsoft.com/app/content/api/content/help/en-us/931699 |
16:46
🔗
|
yano |
got it; so for archive.today, if anyone cares, `archive.fo` is going away, which coincidentally, if you use another domain, you don't get the captcha |
16:47
🔗
|
JAA |
Video: Right, I noticed that as well when I briefly looked into it yesterday. Any idea how high the IDs go? |
16:47
🔗
|
|
manjaro-u has joined #archiveteam-ot |
16:47
🔗
|
Video |
I don't however I'm currently testing those waters myself |
16:47
🔗
|
JAA |
I also noticed that some help articles use UUIDs instead of numerical IDs. |
16:48
🔗
|
JAA |
Specifically, newer ones it seems. |
16:49
🔗
|
Video |
can you give an example |
16:49
🔗
|
JAA |
Yeah, a lot of Office stuff for example: https://support.office.com/en-us/article/download-and-install-or-reinstall-office-365-or-office-2019-on-a-pc-or-mac-4414eaaf-0478-48be-9c42-23adc4716658?ui=en-US&rs=en-US&ad=US |
16:49
🔗
|
JAA |
It doesn't even query the API it seems. |
16:51
🔗
|
Video |
oh god |
16:51
🔗
|
Video |
it even includes the uuid in the name |
16:54
🔗
|
Video |
I can't even find it in Microsoft's RSS feeds for office 365 |
16:57
🔗
|
Video |
i did find some feeds for office XP though |
16:57
🔗
|
Video |
https://support.microsoft.com/app/content/api/content/feeds/sap/en-us/ac9e378c-db19-ecd1-6fdd-b94a2fae7264/rss |
17:00
🔗
|
Video |
the highest ID i've gotten so far is 20191008 |
17:04
🔗
|
Video |
but the amount could get bigger overtime |
17:07
🔗
|
Video |
i've got 4 dedicated instances of wget running for 4 different ranges |
17:07
🔗
|
Video |
and i've gotten about 100 MB worth of data so far |
17:09
🔗
|
Video |
https://video.doesnt-have-a.life/oCQsyrgbdqKP.png |
17:10
🔗
|
|
schbirid has joined #archiveteam-ot |
17:15
🔗
|
|
manjaro-u has quit IRC (Konversation terminated!) |
17:20
🔗
|
|
martini2 has joined #archiveteam-ot |
17:25
🔗
|
|
martini has quit IRC (Read error: Operation timed out) |
17:37
🔗
|
|
akierig has joined #archiveteam-ot |
18:11
🔗
|
|
tuluu_ has quit IRC (Read error: Connection refused) |
18:12
🔗
|
|
tuluu has joined #archiveteam-ot |
18:15
🔗
|
|
bluefoo has quit IRC (Ping timeout: 255 seconds) |
18:18
🔗
|
|
lunik1 has quit IRC (Read error: Operation timed out) |
18:21
🔗
|
|
lunik1 has joined #archiveteam-ot |
18:23
🔗
|
|
Video has quit IRC (Quit: Page closed) |
18:25
🔗
|
|
manjaro-u has joined #archiveteam-ot |
18:39
🔗
|
|
DogsRNice has joined #archiveteam-ot |
19:23
🔗
|
|
akierig has quit IRC (Quit: later_gator) |
19:31
🔗
|
|
bluefoo has joined #archiveteam-ot |
19:45
🔗
|
Raccoon |
JAA: harddrive question; your opinion, what is a good / appropriate "allocation unit" size for the 5TB Western Digital 2.5" external usb HDD, with respect to attempt at closely matching the hardware characteristics of phsycial writes, reads, caching, etc? |
19:46
🔗
|
Raccoon |
If I had a good modern machine, I would just benchmark at different sizes |
19:47
🔗
|
Raccoon |
(suss out which settings cause twice the physical labor) |
21:36
🔗
|
|
BlueMax has joined #archiveteam-ot |
21:37
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
22:09
🔗
|
|
martini2 has quit IRC (Quit: No Reasson) |
22:17
🔗
|
schbirid |
a what? |
22:18
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
22:27
🔗
|
|
lunik1 has quit IRC (Read error: Connection reset by peer) |
22:27
🔗
|
|
lunik1 has joined #archiveteam-ot |
23:03
🔗
|
|
lunik1 has quit IRC (Read error: Connection reset by peer) |
23:37
🔗
|
|
dd33cc has joined #archiveteam-ot |